[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Memory ordering question in the shutdown deferral code

To: Stefano Stabellini <sstabellini@xxxxxxxxxx>
From: Julien Grall <julien@xxxxxxx>
Date: Thu, 24 Sep 2020 12:10:22 +0100
Cc: "Durrant, Paul" <pdurrant@xxxxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, "andrew.cooper3@xxxxxxxxxx" <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, "Xia, Hongyan" <hongyxia@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
Delivery-date: Thu, 24 Sep 2020 11:10:28 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi,

On 23/09/2020 23:57, Stefano Stabellini wrote:

On Mon, 21 Sep 2020, Julien Grall wrote:

On 21/09/2020 13:55, Durrant, Paul wrote:

(+ Xen-devel)

Sorry I forgot to CC xen-devel.

On 21/09/2020 12:38, Julien Grall wrote:

Hi all,

I have started to look at the deferral code (see
vcpu_start_shutdown_deferral()) because we need it for LiveUpdate and
Arm will soon use it.

The current implementation is using an smp_mb() to ensure ordering
between a write then a read. The code looks roughly (I have slightly
adapted it to make my question more obvious):

domain_shutdown()
       d->is_shutting_down = 1;
       smp_mb();
       if ( !vcpu0->defer_shutdown )
       {
         vcpu_pause_nosync(v);
         v->paused_for_shutdown = 1;
       }

vcpu_start_shutdown_deferral()
       vcpu0->defer_shutdown = 1;
       smp_mb();
       if ( unlikely(d->is_shutting_down) )
         vcpu_check_shutdown(v);

       return vcpu0->defer_shutdown;

smp_mb() should only guarantee ordering (this may be stronger on some
arch), so I think there is a race between the two functions.

It would be possible to pause the vCPU in domain_shutdown() because
vcpu0->defer_shutdown wasn't yet seen.

Equally, vcpu_start_shutdown_deferral() may not see d->is_shutting_down
and therefore Xen may continue to send the I/O. Yet the vCPU will be
paused so the I/O will never complete.


The barrier enforces global order, right?


It is not clear to me what you mean by "global ordering". This seems to
suggest a very expensive synchronization barrier between all the processors.

 From an arch-agnostic PoV, smp_mb() will enforce an ordering between
loads/stores but it doesn't guarantee *when* they will be observed.

So, if domain_shutdown() pauses the vcpu then is_shutting_down must
necessarily be visible all cpus.


That's not the guarantee provided by smp_mb() (see above).



I simplified the code further to help us reason about it:


    thread#1 |  thread#2
             |
1) WRITE A  |  WRITE B
2) BARRIER  |  BARRIER
3) READ B   |  READ A

Thank you for writing a simpler example. It allowed me to find a litmus(see [1]) and now I understood why it works. See more below.



I think it is (theoretically) possible for thread#1 to be at 1) and
about to do 2), while thread#2 goes ahead and does 1) 2) 3).

Well it is not that theoritical :). There are many reasons where thissituation can happen. To only cite a few:

    - Threads may run on the same pCPUs
    - The pCPU running the threads may get interrupted

- The data modified is not in the L1 cache, there will be delay toaccess it.

By the time
thread#1 does 2), thread#2 has already completed the entire sequence.

If thread#2 has already done 2), and thread#1 is about to do 3), then I
think we are guaranteed that thread#1 will see the new value of B. > Or
is this the core of the issue we are discussing?

No you are right. I got confused because smp_mb() doesn't guarantee whenthe write/read is completed.

So I blindly assumed that the ordering would be done just at theprocessor level. Instead, the ordering is done at the innershareablelevel (e.g between all the processors) as we are using dmb ish.

Assuming A and B are initialized to 0 and we are writing 1, then thereis no way for both thread to read 0. In which case, the existingshutdown code is fine.


Cheers,

[1]https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/litmus-tests/SB+fencembonceonces.litmus


--
Julien Grall

References:
- Re: Memory ordering question in the shutdown deferral code
  - From: Julien Grall
- RE: Memory ordering question in the shutdown deferral code
  - From: Durrant, Paul
- Re: Memory ordering question in the shutdown deferral code
  - From: Julien Grall
- Re: Memory ordering question in the shutdown deferral code
  - From: Stefano Stabellini

Prev by Date: Re: [PATCH V1 09/16] arm/ioreq: Introduce arch specific bits for IOREQ/DM features
Next by Date: Re: [PATCH RFC 3/4] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()
Previous by thread: Re: Memory ordering question in the shutdown deferral code
Next by thread: Re: Memory ordering question in the shutdown deferral code
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.