Go: cmd/compile: Faster divisibility check for constants

Created on 17 Feb 2019  路  19Comments  路  Source: golang/go

What version of Go are you using (go version)?

1.11

What operating system and processor architecture are you using (go env)?

amd64

What did you do?

Please consider the following code which tests whether the given number is divisible by 19:

func Divisible19(x uint64) bool {
    return x%19 == 0
}

What did you expect to see?

It could be rewritten as:

func Divisible19(x uint64) bool {
    return x*0x86bca1af286bca1b <= 0x0d79435e50d79435
}

which compiles to:

MOVQ    "".X+8(SP), AX
MOVQ    $-8737931403336103397, CX
IMULQ   CX, AX
MOVQ    $970881267037344821, CX
CMPQ    AX, CX

What did you see instead?

Instead, the original code compiles to:

MOVQ    $-5825287602224068931, AX
MOVQ    "".X+8(SP), CX
MULQ    CX
ADDQ    CX, DX
RCRQ    $1, DX
SHRQ    $4, DX
LEAQ    (DX)(DX*8), BX
LEAQ    (DX)(BX*2), DX
CMPQ    CX, DX

Similar situation happens for other constants as well.

FrozenDueToAge NeedsInvestigation Performance

Most helpful comment

It鈥檚 easy enough to condition generic rules based on architecture, and they have minimal compiler performance impact.

It鈥檚 also pretty straightforward to check how many 32 bit divides there are in std; I vaguely recall that there are more than one would suspect.

In short, I鈥檇 be interested in going down this road at least a little. Particularly if #15806 got fixed along the way. :) But I鈥檒l defer to Keith in the end.

All 19 comments

Dup of https://github.com/golang/go/issues/15806? I started on that but got distracted. My initial implementation was broken and I never returned to it. Feel free to pick it up.

cc @randall77 as FYI

@josharian Interestingly, @bmkessler has a library implementing the above paper.

Nice!

@bmkessler, I鈥檇 be happy to help you find your way around if you鈥檇 like to contribute that to the compiler. It might also be helpful in a few parts of the runtime, IIRC.

Yeah, I thought about opening an issue and submitting the changes for that library. The benefit for that method is ~25% on 32-bit divisions/mods by constants for amd64, which I don't think is a very common case.

The main complication for submitting this change, is that the rewrite rules are at the generic level currently e.g. https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/gen/generic.rules#L957-L981 and following which are only broken out by register size and useHmul. The speed up of Lemire's method depends on the relative difference in using Hmul64 versus the additional shifts/average in the current method. So if the generic rewrite rules are going to be modified, you'd need to test on other architectures. I haven't tested any other architecture, but Lemire notes that 32-bit division and mod are slower on arm and power using that method. One benefit to Lemire's method is that the rule is considerably simpler than the Granlund-Montgomery method because there is just a single case that doesn't depend on the constant divisor. However, there would also be two methods, since the current method would continue to be employed for all the other divisions, so I am not sure if the extra complexity is worth it.

It鈥檚 easy enough to condition generic rules based on architecture, and they have minimal compiler performance impact.

It鈥檚 also pretty straightforward to check how many 32 bit divides there are in std; I vaguely recall that there are more than one would suspect.

In short, I鈥檇 be interested in going down this road at least a little. Particularly if #15806 got fixed along the way. :) But I鈥檒l defer to Keith in the end.

I'm ok with all of this if someone wants to implement it, and it really is faster.
My only demand is that we need good comments (like the ones in magic.go) and good tests.

I implemented just the division portion of the rules based on Lemire's method and uint32 is faster on my amd64 machine, but int32 is slower.

name              old time/op  new time/op  delta
DivconstI32-4     1.52ns 卤 0%  1.77ns 卤 0%  +16.71%  (p=0.016 n=4+5)
DivconstU32-4     1.95ns 卤 0%  1.38ns 卤 0%  -29.09%  (p=0.000 n=5+4)

I can push that change. I didn't add the mod rule because I wasn't sure how to handle

q, r = x/d, x%d

currently the rules rewrite that as the equivalent of

q, r = x/d, x - (x/d)*d

So the x/d can be used in both calculations. If a different mod rule is put in place, there will be duplicate calculation for this case (three multiplications instead of two), but standalone mods would be faster as they are just two multiplications and no subtraction. So I am not sure if a stand alone mod rule is useful or if the case of division and mod can be handled some straightforward way. Just adding this div rule already speeds up the mod decently though.

ModconstU32-4     2.42ns 卤 0%  2.07ns 卤 0%  -14.46%  (p=0.008 n=5+5)

Thank you for looking into this! But, and I apologize if I misunderstood, the issue is about divisibility check by constant (i.e. x%d == 0), rather than division by constant (i.e. x%d). The former seem to be simpler and for small d (perhaps even as large as uint32?) it should be possible to test uint64%d == 0 with just one multiplication and compare, as shown at vert first comment. But again, maybe I misunderstood?

Sorry, I should have been more clear. It is possible to implement x%d == 0 in a single multiplication and compare as you mention for unsigned odd divisors of any size. For even divisors, there is an additional rotate required in the Granlund-Montgomery method and signed divisors require an additional add constant in both cases. (Hacker's Delight 10-17 is a better expanded reference than the note in the original paper).

As an initial update, I started to implement all of the changes from the Lemire paper for division, modulus and divisibility check on 32-bit integers, which are faster and don't involve any separate cases depending on the divisor (same magic numbers), but wasn't sure how to proceed for mod and divisibility since the current rules already rewrite Mod into a division. It looks like @josharian ran into the same issue on #15806 where he ended up using matching against the already expanded form of the Mod. Maybe he can help sort out how best to set up the rules and we can update the divisibility check for all the sizes of integers.

I鈥檓 afk for a bit. But...would something along the lines of https://go-review.googlesource.com/c/go/+/129379 work for you?

I suspect it is often the case that when you do x%c == 0, you'll also be doing x/c. So I think we need some way to determine you're not doing the latter before we optimize the former.

It might be a good idea to grovel through some Go code to see how correct my suspicion is. If it is 99% correct, the x%c==0 optimization is ~useless. If it is only 10% correct, maybe we just do the optimization regardless. If it is in between, we might need to detect the divide somehow.

Do x/c and x%c==0 use the results of the same multiplies? Those are the most expensive sub-ops, so if they are shared that might be good enough.

@randall77 I know about this, which is x%c == 0 without the division.

@josharian Thanks! Disabling the Mod rule on the opt pass allowed the divisibility rule to be applied. @randall77 I miscounted instructions earlier, the multiplies use different constants (modular inverse for the divisibility check), but even when both x/d and x%d==0 are needed that is still only two multiplies in the old and new code. I benchmarked the divisibility check in both cases and the pure check x%d == 0 is about 35% faster on amd64 for uint64 and the case with both x/d and x%d == 0 is only 3% slower, so it seems like the divisibility check is a worthwhile optimization as long as the cases where we need both x/d and x%d == 0 are < 92% of the time.

name                     old time/op  new time/op  delta
DivisibleconstU64-4      3.00ns 卤 0%  1.92ns 卤 0%  -35.96%  (p=0.000 n=5+4)
DivisibleWDivconstU64-4  3.46ns 卤 1%  3.56ns 卤 1%   +2.89%  (p=0.008 n=5+5)

In addition to the isPrime above, the only other raw uses of numeric constants for x%n == 0 other than n a power of two in the standard library seem to be runtime and time/isLeap neither of which have a division with the same constant. There are a three uses of x%d != 0 as well.

I can push something up for review after fleshing out the tests and comments.

Change https://golang.org/cl/168037 mentions this issue: cmd/compile: add unsigned divisibility rules

Change https://golang.org/cl/173998 mentions this issue: cmd/compile: add signed divisibility rules

Thank you!

Was this page helpful?
0 / 5 - 0 ratings