[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] x86: use POPCNT for hweight<N>() when available


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Wed, 20 May 2020 19:18:29 +0200
  • Authentication-results: esa6.hc3370-68.iphmx.com; dkim=none (message not signed) header.i=none
  • Cc: "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Delivery-date: Wed, 20 May 2020 17:18:57 +0000
  • Ironport-sdr: uweaNdcDgBW1fhKufj7EQpRybnowEQbPy/Dd5z0s3f7GtfpVb3JEo+OWOm9+ff292dZyeXCG7M OSX3MZNsqFxMv+o8LQm1+kboYdlOSvaBoJpJslzvbH8avixXjnKbOSb/Q7RiKCJT2WSl/aAbfs 89460la65joHaGxszw58o1TXjDEBCKcOQhaMvTt+xMeSYdLChf4uLPaX8DYHggTBQXsVFBt509 c4cEJsqd8KMZqqwFA1TBHeb7P94LfJ5uIomgNzWu5u/P2NFLYIj0iEttZ66BdmgZWUCX9zGk3N bW4=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, May 20, 2020 at 03:12:25PM +0200, Jan Beulich wrote:
> On 20.05.2020 13:43, Roger Pau Monné wrote:
> > On Wed, May 20, 2020 at 12:57:27PM +0200, Jan Beulich wrote:
> >> On 20.05.2020 12:28, Roger Pau Monné wrote:
> >>> On Wed, May 20, 2020 at 12:17:15PM +0200, Jan Beulich wrote:
> >>>> On 20.05.2020 11:31, Roger Pau Monné wrote:
> >>>>> On Wed, May 20, 2020 at 10:31:38AM +0200, Jan Beulich wrote:
> >>>>>> On 14.05.2020 16:05, Roger Pau Monné wrote:
> >>>>>>> On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote:
> >>>>>>>> @@ -251,6 +255,10 @@ boot/mkelf32: boot/mkelf32.c
> >>>>>>>>   efi/mkreloc: efi/mkreloc.c
> >>>>>>>>      $(HOSTCC) $(HOSTCFLAGS) -g -o $@ $<
> >>>>>>>>   
> >>>>>>>> +nocov-y += hweight.o
> >>>>>>>> +noubsan-y += hweight.o
> >>>>>>>> +hweight.o: CFLAGS += $(foreach reg,cx dx si 8 9 10 
> >>>>>>>> 11,-ffixed-r$(reg))
> >>>>>>>
> >>>>>>> Why not use clobbers in the asm to list the scratch registers? Is it
> >>>>>>> that much expensive?
> >>>>>>
> >>>>>> The goal is to disturb the call sites as little as possible. There's
> >>>>>> no point avoiding the scratch registers when no call is made (i.e.
> >>>>>> when the POPCNT insn can be used). Taking away from the compiler 7
> >>>>>> out of 15 registers (that it can hold live data in) seems quite a
> >>>>>> lot to me.
> >>>>>
> >>>>> IMO using -ffixed-reg for all those registers is even worse, as that
> >>>>> prevents the generated code in hweight from using any of those, thus
> >>>>> greatly limiting the amount of registers and likely making the
> >>>>> generated code rely heavily on pushing an popping from the stack?
> >>>>
> >>>> Okay, that's the other side of the idea behind all this: Virtually no
> >>>> hardware we run on will lack POPCNT support, hence the quality of
> >>>> these fallback routines matters only the very old hardware, where we
> >>>> likely don't perform optimally already anyway.
> >>>>
> >>>>> This also has the side effect to limiting the usage of popcnt to gcc,
> >>>>> which IMO is also not ideal.
> >>>>
> >>>> Agreed. I don't know enough about clang to be able to think of
> >>>> possible alternatives. In any event there's no change to current
> >>>> behavior for hypervisors built with clang.
> >>>>
> >>>>> I also wondered, since the in-place asm before patching is a call
> >>>>> instruction, wouldn't inline asm at build time already assume that the
> >>>>> scratch registers are clobbered?
> >>>>
> >>>> That would imply the compiler peeks into the string literal of the
> >>>> asm(). At least gcc doesn't, and even if it did it couldn't infer an
> >>>> ABI from seeing a CALL insn.
> >>>
> >>> Please bear with me, but then I don't understand what Linux is doing
> >>> in arch/x86/include/asm/arch_hweight.h. I see no clobbers there,
> >>> neither it seems like the __sw_hweight{32/64} functions are built
> >>> without the usage of the scratch registers.
> >>
> >> __sw_hweight{32,64} are implemented in assembly, avoiding most
> >> scratch registers while pushing/popping the ones which do get
> >> altered.
> > 
> > Oh right, I was looking at lib/hweight.c instead of the arch one.
> > 
> > Would you agree to use the no_caller_saved_registers attribute (which
> > is available AFAICT for both gcc and clang) for generic_hweightXX and
> > then remove the asm prefix code in favour of the defines for
> > hweight{8/16}?
> 
> At least for gcc no_caller_saved_registers isn't old enough to be
> used unconditionally (nor is its companion -mgeneral-regs-only).
> If you tell me it's fine to use unconditionally with clang, then
> I can see about making this the preferred variant, with the
> present one as a fallback.

Hm, so my suggestion was bad, no_caller_saved_registers is only
implemented starting clang 5, which is newer than the minimum we
currently require (3.5).

So apart from adding a clobber to the asm instance covering the
scratch registers the only option I can see as viable is using a bunch
of dummy global variables assigned to the registers we need to prevent
the software generic_hweightXX functions from using, but that's ugly
and will likely trigger warnings at least, since I'm not sure the
compiler will find it safe to clobber a function call register with a
global variable. Or adding a prolegue / epilogue to the call
instruction in order to push / pop the relevant registers. None seems
like a very good option IMO.

I also assume that using no_caller_saved_registers when available or
else keeping the current behavior is not an acceptable solution? FWIW,
from a FreeBSD PoV I would be OK with that, as I don't think there are
any supported targets with clang < 5.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.