|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v5 11/23] xen/riscv: introduce cmpxchg.h
On 07.03.2024 11:35, Oleksii wrote:
> On Wed, 2024-03-06 at 15:56 +0100, Jan Beulich wrote:
>> On 26.02.2024 18:38, Oleksii Kurochko wrote:
>>> The header was taken from Linux kernl 6.4.0-rc1.
>>>
>>> Addionally, were updated:
>>> * add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
>>> access.
>>> * replace tabs with spaces
>>> * replace __* variale with *__
>>> * introduce generic version of xchg_* and cmpxchg_*.
>>>
>>> Implementation of 4- and 8-byte cases were left as it is done in
>>> Linux kernel as according to the RISC-V spec:
>>> ```
>>> Table A.5 ( only part of the table was copied here )
>>>
>>> Linux Construct RVWMO Mapping
>>> atomic <op> relaxed amo<op>.{w|d}
>>> atomic <op> acquire amo<op>.{w|d}.aq
>>> atomic <op> release amo<op>.{w|d}.rl
>>> atomic <op> amo<op>.{w|d}.aqrl
>>>
>>> Linux Construct RVWMO LR/SC Mapping
>>> atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
>>> atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
>>> atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez
>>> loop OR
>>> fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ;
>>> bnez loop
>>> atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
>>> loop
>>>
>>> The Linux mappings for release operations may seem stronger than
>>> necessary,
>>> but these mappings are needed to cover some cases in which Linux
>>> requires
>>> stronger orderings than the more intuitive mappings would provide.
>>> In particular, as of the time this text is being written, Linux is
>>> actively
>>> debating whether to require load-load, load-store, and store-store
>>> orderings
>>> between accesses in one critical section and accesses in a
>>> subsequent critical
>>> section in the same hart and protected by the same synchronization
>>> object.
>>> Not all combinations of FENCE RW,W/FENCE R,RW mappings with aq/rl
>>> mappings
>>> combine to provide such orderings.
>>> There are a few ways around this problem, including:
>>> 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This
>>> suffices
>>> but is undesirable, as it defeats the purpose of the aq/rl
>>> modifiers.
>>> 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does
>>> not
>>> currently work due to the lack of load and store opcodes with aq
>>> and rl
>>> modifiers.
>>
>> As before I don't understand this point. Can you give an example of
>> what
>> sort of opcode / instruction is missing?
> If I understand the spec correctly then l{b|h|w|d} and s{b|h|w|d}
> instructions don't have aq or rl annotation.
How would load insns other that LR and store insns other than SC come
into play here?
>>> 3. Strengthen the mappings of release operations such that they
>>> would
>>> enforce sufficient orderings in the presence of either type of
>>> acquire mapping.
>>> This is the currently-recommended solution, and the one shown in
>>> Table A.5.
>>> ```
>>>
>>> But in Linux kenrel atomics were strengthen with fences:
>>> ```
>>> Atomics present the same issue with locking: release and acquire
>>> variants need to be strengthened to meet the constraints defined
>>> by the Linux-kernel memory consistency model [1].
>>>
>>> Atomics present a further issue: implementations of atomics such
>>> as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
>>> which do not give full-ordering with .aqrl; for example, current
>>> implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
>>> below to end up with the state indicated in the "exists" clause.
>>>
>>> In order to "synchronize" LKMM and RISC-V's implementation, this
>>> commit strengthens the implementations of the atomics operations
>>> by replacing .rl and .aq with the use of ("lightweigth") fences,
>>> and by replacing .aqrl LR/SC pairs in sequences such as:
>>>
>>> 0: lr.w.aqrl %0, %addr
>>> bne %0, %old, 1f
>>> ...
>>> sc.w.aqrl %1, %new, %addr
>>> bnez %1, 0b
>>> 1:
>>>
>>> with sequences of the form:
>>>
>>> 0: lr.w %0, %addr
>>> bne %0, %old, 1f
>>> ...
>>> sc.w.rl %1, %new, %addr /* SC-release */
>>> bnez %1, 0b
>>> fence rw, rw /* "full" fence */
>>> 1:
>>>
>>> following Daniel's suggestion.
>>>
>>> These modifications were validated with simulation of the RISC-V
>>> memory consistency model.
>>>
>>> C lr-sc-aqrl-pair-vs-full-barrier
>>>
>>> {}
>>>
>>> P0(int *x, int *y, atomic_t *u)
>>> {
>>> int r0;
>>> int r1;
>>>
>>> WRITE_ONCE(*x, 1);
>>> r0 = atomic_cmpxchg(u, 0, 1);
>>> r1 = READ_ONCE(*y);
>>> }
>>>
>>> P1(int *x, int *y, atomic_t *v)
>>> {
>>> int r0;
>>> int r1;
>>>
>>> WRITE_ONCE(*y, 1);
>>> r0 = atomic_cmpxchg(v, 0, 1);
>>> r1 = READ_ONCE(*x);
>>> }
>>>
>>> exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
>>
>> While I'm entirely willing to trust this can happen, I can't bring
>> this
>> in line with the A extension spec.
>>
>> Additionally it's not clear to me in how far all of this applies when
>> you don't really use LR/SC in the 4- and 8-byte cases (and going
>> forward
>> likely also not in the 1- and 2-byte case, utilizing Zahba when
>> available).
> It just explain what combination of fences, lr/sc, amoswap, .aq and .rl
> annotation can be combined, and why combinations introduced in this
> patch are used.
Except that I don't understand that explanation, iow why said combination
of values could be observed even when using suffixes properly.
>>> + uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 -
>>> sizeof(*ptr))) * BITS_PER_BYTE; \
>>
>> Why uint8_t?
> It is enough to cover possible start bit position of value that should
> be updated, so I decided to use uint8_t.
Please take a look at the "Types" section in ./CODING_STYLE.
>>> + { \
>>> + case 1: \
>>> + case 2: \
>>> + ret__ = emulate_xchg_1_2(ptr, new__, sfx, pre, post); \
>>> + break; \
>>> + case 4: \
>>> + __amoswap_generic(ptr, new__, ret__,\
>>> + ".w" sfx, pre, post); \
>>> + break; \
>>> + case 8: \
>>> + __amoswap_generic(ptr, new__, ret__,\
>>> + ".d" sfx, pre, post); \
>>> + break; \
>>
>> In io.h you make sure to avoid rv64-only insns. Here you don't. The
>> build
>> would fail either way, but this still looks inconsistent.
>>
>> Also nit: Stray double blands (twice) ahead of "pre". Plus with this
>> style
>> of line continuation you want to consistently have exactly one blank
>> ahead
>> of each backslash.
>>
>>> + default: \
>>> + STATIC_ASSERT_UNREACHABLE(); \
>>> + } \
>>> + ret__; \
>>> +})
>>> +
>>> +#define xchg_relaxed(ptr, x) \
>>> +({ \
>>> + __typeof__(*(ptr)) x_ = (x); \
>>
>> What is the purpose of this, when __xchg_generic() already does this
>> same
>> type conversion?
>>
>>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),
>>> "", "", ""); \
>>> +})
>>> +
>>> +#define xchg_acquire(ptr, x) \
>>> +({ \
>>> + __typeof__(*(ptr)) x_ = (x); \
>>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
>>> + "", "",
>>> RISCV_ACQUIRE_BARRIER); \
>>> +})
>>> +
>>> +#define xchg_release(ptr, x) \
>>> +({ \
>>> + __typeof__(*(ptr)) x_ = (x); \
>>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
>>> + "", RISCV_RELEASE_BARRIER,
>>> ""); \
>>> +})
>>
>> As asked before: Are there going to be any uses of these three?
>> Common
>> code doesn't require them. And not needing to provide them would
>> simplify things quite a bit, it seems.
> I checked my private branches and it looks to me that I introduced them
> only for the correspondent atomic operations ( which was copied from
> Linux Kernel ) which are not also used.
>
> So we could definitely drop these macros for now, but should
> xchg_generic() be updated as well? If to look at:
> #define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x), sizeof(*
> (ptr)), \
> ".aqrl", "", "")
> Last two arguments start to be unneeded, but I've wanted to leave them,
> in case someone will needed to back xchg_{release, acquire, ...}. Does
> it make any sense?
It all depends on how it's justified in the description.
>>> +#define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x),
>>> sizeof(*(ptr)), \
>>> + ".aqrl", "", "")
>>
>> According to the earlier comment (where I don't follow the example
>> given),
>> is .aqrl sufficient here? And even if it was for the 4- and 8-byte
>> cases,
>> is it sufficient in the 1- and 2-byte emulation case (where it then
>> is
>> appended to just the SC)?
> If I understand your question correctly then accroding to the spec.,
> .aqrl is enough for amo<op>.{w|d} instructions:
> Linux Construct RVWMO AMO Mapping
> atomic <op> relaxed amo<op>.{w|d}
> atomic <op> acquire amo<op>.{w|d}.aq
> atomic <op> release amo<op>.{w|d}.rl
> atomic <op> amo<op>.{w|d}.aqrl
> but in case of lr/sc you are right sc requires suffix too:
> Linux Construct RVWMO LR/SC Mapping
> atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
> atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
> atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez
> loop OR fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop
> atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
> loop
>
> I will add sc_sfx to emulate_xchg_1_2(). The only question is left if
> __xchg_generic(ptr, new, size, sfx, pre, post) should be changed to:
> __xchg_generic(ptr, new, size, sfx1, sfx2, pre, post) to cover both
> cases amo<op>.{w|d}.sfx1 and lr.{w|d}.sfx1 ... sc.{w|d}.sfx2?
I expect that's going to be necessary. In the end you'll see what's needed
when making the code adjustment.
Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |