[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCHv2 0/3] Implement per-cpu reader-writer locks

To: Malcolm Crossley <malcolm.crossley@xxxxxxxxxx>, <JBeulich@xxxxxxxx>, <ian.campbell@xxxxxxxxxx>, <andrew.cooper3@xxxxxxxxxx>, <Marcos.Matsunaga@xxxxxxxxxx>, <keir@xxxxxxx>, <konrad.wilk@xxxxxxxxxx>, <george.dunlap@xxxxxxxxxxxxx>
From: George Dunlap <george.dunlap@xxxxxxxxxx>
Date: Tue, 24 Nov 2015 18:30:53 +0000
Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, stefano.stabellini@xxxxxxxxxx
Delivery-date: Tue, 24 Nov 2015 18:31:17 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 24/11/15 18:16, George Dunlap wrote:
> On 20/11/15 16:03, Malcolm Crossley wrote:
>> This patch series adds per-cpu reader-writer locks as a generic lock
>> implementation and then converts the grant table and p2m rwlocks to
>> use the percpu rwlocks, in order to improve multi-socket host performance.
>>
>> CPU profiling has revealed the rwlocks themselves suffer from severe cache
>> line bouncing due to the cmpxchg operation used even when taking a read lock.
>> Multiqueue paravirtualised I/O results in heavy contention of the grant table
>> and p2m read locks of a specific domain and so I/O throughput is bottlenecked
>> by the overhead of the cache line bouncing itself.
>>
>> Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data
>> area to record a CPU has taken the read lock. Correctness is enforced for 
>> the 
>> write lock by using a per lock barrier which forces the per-cpu read lock 
>> to revert to using a standard read lock. The write lock then polls all 
>> the percpu data area until active readers for the lock have exited.
>>
>> Removing the cache line bouncing on a multi-socket Haswell-EP system 
>> dramatically improves performance, with 16 vCPU network IO performance going 
>> from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 
>> logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better
>> IO improvement.
> 
> Impressive -- thanks for doing this work.
> 
> One question: Your description here sounds like you've tested with a
> single large domain, but what happens with multiple domains?
> 
> It looks like the "per-cpu-rwlock" is shared by *all* locks of a
> particular type (e.g., all domains share the per-cpu p2m rwlock).
> (Correct me if I'm wrong here.)

Sorry, looking in more detail at the code, it seems I am wrong.  The
fast-path stores which "slow" lock has been grabbed in the per-cpu
variable; so the writer only needs to wait for readers that have grabbed
the particular lock it's interested in.  So the scenarios I outline
below shouldn't really be issues.

The description of the algorithm  in the changelog could do with a bit
more detail. :-)

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [PATCHv2 0/3] Implement per-cpu reader-writer locks
  - From: Malcolm Crossley

References:
- [Xen-devel] [PATCHv2 0/3] Implement per-cpu reader-writer locks
  - From: Malcolm Crossley
- Re: [Xen-devel] [PATCHv2 0/3] Implement per-cpu reader-writer locks
  - From: George Dunlap

Prev by Date: Re: [Xen-devel] [PATCH] build: remove .d files from xen/ on a clean
Next by Date: Re: [Xen-devel] [PATCHv2 0/3] Implement per-cpu reader-writer locks
Previous by thread: Re: [Xen-devel] [PATCHv2 0/3] Implement per-cpu reader-writer locks
Next by thread: Re: [Xen-devel] [PATCHv2 0/3] Implement per-cpu reader-writer locks
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.