go 🚀 - math/bits: add extended precision Add, Sub, Mul, Div

amd64 also has these instructions, and access to them is one of the major blockers to writing performant crypto in Go instead of assembly.

I think the best place for them would be math/bits. Can you give us a preview of the exported APIs?

/cc @vkrasnov @gtank (they probably have opinions)

FiloSottile on 12 Apr 2018

👍2

/cc @mundaym @TocarIP too, perhaps

mvdan on 12 Apr 2018

MulWW - this is same as math/big.mulWW
AddQQ - adds two uint128. Each uint128 is represented as a pair of uint64 (hi, lo) returns uint128(hi,lo)
AddQW - adds a uint128 and uint64 returns uint128(hi, lo)
ShrQ44 - shiftright uint128 with constant 44 returns uint64
ShrQ42 - shiftright uint128 with constant 42 returns uint64
The arm64 assembly for these APIs are in https://go-review.googlesource.com/c/go/+/106376/3/src/vendor/golang_org/x/crypto/poly1305/asm_arm64.s
ShrQ44 and ShrQ42 can be combined into one API I think.

bmakam-qdt on 12 Apr 2018

This is useful, no doubt. Intel can definitely benefit too. I'd put them all in math/big. The ShrQXX should probably be generic?

vkrasnov on 12 Apr 2018

👍1

arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.

bmakam-qdt on 12 Apr 2018

I don't like the idea of polluting math/big. It already has a million godoc entries, but at least they all deal in Int/Float/Rat, while these would split a fixed size uint128 in hi/lo uint64.

FiloSottile on 12 Apr 2018

👍6

arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.

Intel should have SHRD.

I don't like the idea of polluting math/big

Maybe create something new? like math/uint128

vkrasnov on 12 Apr 2018

Is there any chance of writing this in regular Go code without any extra APIs, and having the compiler recognise the pattern to optimise?

mvdan on 12 Apr 2018

👍1

are you suggesting a builtin [u]int128 type like how GCC does: https://godbolt.org/g/o6ka5e ?

bmakam-qdt on 12 Apr 2018

@mvdan Not really. The high bits of uint64 * uint64 are simply discarded. We would need uint128 in the language. (#9455)

FiloSottile on 12 Apr 2018

That is probably too big a change for a very niche use case. I think that less than 0.1% of Go projects would need these.

vkrasnov on 12 Apr 2018

If anywhere these would probably belong in math/bits, not math/big. I think some of them were even mooted during the original math/bits API discussion. (On phone so don’t have issue link handy.)

josharian on 12 Apr 2018

👍5

math/bits seems like the right place.
I think you only need the following building blocks:

func Mul(x, y uint64) (uint64, uint64) // 64x64 -> 128 multiply (low, then high)
func Add(x, y uint64) (uint64, uint64) // 64x64 -> 65 add (low, then high)

(and maybe signed versions?)
Everything else can be written in Go.

Even Add is not really necessary, but might make it easier for the compiler to detect the carry. (In Go, the carry is low_result < low_x || low_result < low_y or something.)

The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)

randall77 on 12 Apr 2018

I think AddQQ and AddQW have uint128 inputs, not uint64.

Agreed on not being useful enough to change the language. I don't hate math/uint128, but if we can boil them down to a couple API in math/bits and detecting some patterns, that's better.

FiloSottile on 12 Apr 2018

If we have 64x64 -> 128, we might probably also have 32x32 -> 64 on 32-bit architectures. On 64-bit architectures this would be trivial.

cherrymui on 12 Apr 2018

Can't you already get a 32x32->64 multiply on 32-bit archs? 386 can, at least:

func f(x, y uint32) (uint32, uint32) {
    z := uint64(x) * uint64(y)
    return uint32(z), uint32(z >> 32)
}

compiles to

    0x0012 00018 (/home/khr/go/tmp1.go:3)   MOVL    "".y+8(SP), AX
    0x0016 00022 (/home/khr/go/tmp1.go:3)   MOVL    "".x+4(SP), CX
    0x001a 00026 (/home/khr/go/tmp1.go:4)   MULL    CX
    0x001c 00028 (/home/khr/go/tmp1.go:4)   MOVL    AX, "".~r2+12(SP)
    0x0020 00032 (/home/khr/go/tmp1.go:4)   MOVL    DX, "".~r3+16(SP)

randall77 on 12 Apr 2018

32x32 -> 64 can be handled by pattern matching thanks to uint64. This is about filling the uint128 void.

FiloSottile on 12 Apr 2018

You are right. 32x32->64 can already be handled.

cherrymui on 12 Apr 2018

previous conversation about uint128: https://github.com/golang/go/issues/9455

ericlagergren on 12 Apr 2018

I like an idea about exposing mulWW and divWW, which we already intrinsify, for general use, but I'm not sure about right package. IIRC there was Gophercon talk about optimizing 64x64->128 multiplication, so this looks like something useful for others.
This should also help hash/fnv128, which currently does ugly stuff like this:

                 s1l := (s[1] & 0xffffffff) * prime128Lower
                 s1h := (s[1] >> 32) * prime128Lower
                 s0l := (s[0]&0xffffffff)*prime128Lower + (s[1]&0xffffffff)<<prime128Shift
                 s0h := (s[0]>>32)*prime128Lower + (s[1]>>32)<<prime128Shift
                 // Carries
                 s1h += s1l >> 32
                 s0l += s1h >> 32
                 s0h += s0l >> 32
                 // Update the values
                 s[1] = (s1l & 0xffffffff) + (s1h << 32)
                 s[0] = (s0l & 0xffffffff) + (s0h << 32)

TocarIP on 12 Apr 2018

The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)

I think we might not need the intrinsic for multiword shiftright because we can create a pattern that can be detected by the compiler, something like this:
ShrQconst(42, z[5], z[4]) -> (z5<<22 | z4>>42)

I think AddQQ and AddQW have uint128 inputs, not uint64.

Yes, AddQQ takes two uint128 i.e two pairs of uint64 and outputs the result to uint128(hi, lo) and AddQW takes a uint128 and uint64 and outputs the result to uint128(hi, lo).

bmakam-qdt on 12 Apr 2018

I am strongly in favor of doing this in some form.

If it's going to be explicit functions, then math/bits or a dedicated math/{intrinsic, uint128, multiword} is a better place for it than math/big, since we care a lot about the details of the underlying representation here.

For crypto, my dream API would give me direct mappings to the basic func: uint64 x uint64 -> uint64 x uint64 functions (mulq, addq, etc for amd64), then have smart pattern matching do the carry/flag/pipeline-aware instruction optimizations when possible (think add/adc, or the newer mulx/adcx/adox). There's an example of this being a huge win in math/big already and Intel wrote a guide for how to apply those instructions to patterns that are common in crypto code.

gtank on 13 Apr 2018

👍2

Change https://golang.org/cl/106376 mentions this issue: cmd/compile: add intrinsics for multiword arithmetic

gopherbot on 13 Apr 2018

Sorry, I jumped the gun and assumed we have a consensus on math/bits as the right package for the APIs and took a stab at it in CL106376. My sincere apologies.

bmakam-qdt on 13 Apr 2018

I strongly disagree about putting these in math/bits: these are not operations on bits. They may involve operations on bits, but ultimately so does everything.

I really do not want to see math/bits turn into a junk drawer.

The proposed operations are a new class of operations and they deserve their own package for ease of discovery, to make reading code that uses these operations clear, and so that they can have focused documentation.

jimmyfrasche on 13 Apr 2018

👍4

I don't see a need for both AddQW and AddQQ. AddQW(x1, x0, y) = AddQQ(x1, x0, 0, y). The compiler should have no problem constant folding the 0. Just have a single Add.
In any case, I don't like Q and W. What do they stand for? (quadword and word? That makes no sense here.)

randall77 on 13 Apr 2018

👍3

The proposed operations are a new class of operations and they deserve their own package

math/medium? :)

josharian on 13 Apr 2018

😄5

I agree with @jimmyfrasche.

I think math/big makes some sense. They are "big" integers, just not arbitrarily big, but still bigger than uint64.

cherrymui on 13 Apr 2018

👎1

Agreed on just having Add (and Mul) and detecting constant 0. Let's not leak the instruction names.

They definitely don't fit in math/big. I see the concern for math/bits becoming math/intrinsicutils, but I am not convinced that any other package name would grow past two functions, or have the same issue. And this is about multiplication of high and low bits, even if it's a stretch.

FiloSottile on 13 Apr 2018

BTW, @bmakam-qdt thanks for doing the prototyping work. It's very valuable even if the proposal is not finalized yet!

FiloSottile on 13 Apr 2018

👍3

@josharian I didn't say coming up with a good name would be easy—just that it's the right thing to do :þ

Can methods be compiler intrinsics? A lot of the naming and documentation could be simplified if it were something like

type Uint struct {
  Hi, Lo uint64
}
func (a Uint) Mul(b Uint) Uint

Maybe then math/math128? The import stutters a bit and the name is slightly unwieldy but it would only be needed for the types and related factories. It could contain math128.Uint, math128.Int, and perhaps even math128.Float.

That's not a great name but hopefully it spurs a better one.

jimmyfrasche on 13 Apr 2018

These can't be in math/big. Personally I think math/bits is OK but I respect the objections. math/math128 is not bad but sounds to me like a place for the 128-bit integer types, which isn't quite what this is. It seems to me that we are going for is basically double word operations. Unfortunately C suggests that double is a floating point value. Perhaps math/wide? wide.Mul?

ianlancetaylor on 13 Apr 2018

👍7

FWIW, I also think they're not a good fit for math/bits since they're about integer semantics rather than the literal bits.

What does everyone think of math/intrinsic with the chance to handle future needs of this type there too?

package intrinsic

type Uint128 struct {
 Hi, Lo uint64
}
func Mul(a, b uint64) Uint128

gtank on 13 Apr 2018

Perhaps math/wide? wide.Mul?

I like math/wide; it does double duty, because (a) the types are wide and (b) it evokes "widening arithmetic".

Is there any danger of wanting 256+ bit arithmetic anytime soon? If so, we might want Mul128. Or maybe Mul64, since the inputs are 64 bit?

cc @kortschak @btracey in case they want to weigh in from a scientific computing perspective. (This has all been crypto-oriented so far.)

josharian on 13 Apr 2018

👍3

I don't like math/intrinsic it exposes implementation details of gc and by that logic we will need to move most of math/bits to math/intrinsic because we already intrinsify most of it.

On an unrelated note what about 128-bit subtraction? Just express it as an add with -x?

TocarIP on 13 Apr 2018

👍1

I'm not sure the Uint128 type would let the kind of code this is for flow well. What is now x.Hi can later be something else and now you have to assign it a new name. And it might not play well with SSA (arrays don't). Let's not overcomplicate it, this is for so-low-level-that-it-was-going-to-be-assembly code.

No on math/intrinsic too, they won't be intrinsified everywhere, and it's a "utils"-like name.

FiloSottile on 13 Apr 2018

👍1

(Another advantage to math/wide is that it makes it obvious where saturating arithmetic operations live, should we decide to add them later: math/saturate or maybe just math/sat.)

josharian on 13 Apr 2018

Looping in @kunde21 and @vladimir-ch because they have spent a bunch of time with our assembly implementations. I believe the short comment is that we would really love a way to easily use SSE / AVX (for float64). This is partly due to the direct speedups from using the types, but also that it would be great for the compiler to know these instructions so inlining can happen within the context of bigger blocks.

btracey on 13 Apr 2018

👍2

The biggest issues with math/wide is someone looking at the package index for the first time and thinking "wait, what's wide and what's big?" but the doc synopsis would cover that and that's good enough for me.

My main point with introducing the types is that it would allow the operations on those types to be namespaced so there wouldn't need to be MulU128, MullI128, MulF128, etc.—they could just be T.Mul(T) for T in {Int, Uint, Float}.

jimmyfrasche on 13 Apr 2018

For naming, maybe math/split (as in, “a uint128 split into two uint64s”)?

But it seems like if this is worthwhile, we should just do #9455 instead. Intrinsics would require compiler support either way, and a built-in type seems much more ergonomic (less risk of accidental transposition, and more similarity to integer operations at other sizes).

bcmills on 14 Apr 2018

Mul128, type Uint128 struct{Hi, Lo uint64}, and such seem like a poor substitute for a proper uint128 type. I mean, if it's being intrinsified the compiler is already recognizing the routine, so I fail to see the difference between that and adding uint128 as a builtin. That could also be my ignorance showing, though.

ericlagergren on 14 Apr 2018

👍1

this is for so-low-level-that-it-was-going-to-be-assembly code.

How much faster is the assembly code? Usually in these packages writing assembly (for larger sequences) tends to expose other optimization opportunities as well.

rsc on 16 Apr 2018

@rsc This improved the performance of crypto/poly1305 by ~30% on arm64 (Amberwing). There is more optimization opportunity if we use neon/simd instructions.

bmakam-qdt on 16 Apr 2018

I could see Add64 (64,64 -> 64,1) and Mul64 (64,64 -> 64,64), but beyond that I don't see much that's generally important. Hard-coding higher-level things like 64,64,64,64 -> 64,64 for 128-bit add seems like going too far.

My question was not how much faster it is to have intrinsics. It was how much additional speed you get from assembly compared to limited intrinsics.

rsc on 16 Apr 2018

Agreed on just having Add64 and Mul64.

how much additional speed you get from assembly compared to limited intrinsics.

Not sure if I understood this correctly, but if we don't intrinsify these intrinsics, the assembly version is 50% faster.

bmakam-qdt on 16 Apr 2018

Based on discussion with @agl and @FiloSottile:

The argument for adding these is not so much for the mainline architectures where we have people to write assembly for larger operations, but instead for the less popular architectures, where intrinsics let us use portable code that executes faster. It sounds like the basic operations are:

func Mul64(x, y uint64) (hi, lo uint64)
func Mul32 not necessary because it's just 64-bit multiply
func Div64(hi, lo, x uint64) (quo, rem uint64)
func Div32 not necessary because it's just 64-bit divide

and

func Add64(x, y, carry uint64) (sum, carry uint64) // carry assumed to be 0 or 1
func Add32
func Add (uint-sized)
func Sub64(x, y, borrow uint64) (difference, borrow uint64) // borrow assumed to be 0 or 1
func Sub32
func Sub (uint-sized)

For shift it seems like pattern-matching is fine: L/R Shift 128/n -> 64 for example is(x<<n) | (y>>(64-n))

It would be nice to have, say, three specific algorithms to use as test cases and make sure that this set is useful. We don't want to add every last extra thing, but we do want to make sure it covers enough to help.

If someone can identify those algorithms, I'm OK with accepting this.

These would go in math/bits.

rsc on 20 Apr 2018

👍2

One good example is x/crypto/ed25519. It's currently a 32-bit implementation on all platforms. I have a 64-bit version with both pure Go and assembly math, and the mul64 emulation overhead makes the Go slower than the existing code. That should not be true, and intrinsic calls should fix it.

gtank on 20 Apr 2018

Also based on discussion with @rsc, these are going in math/bits.

FiloSottile on 20 Apr 2018

😕2 👎2 👍1

x/crypto/poly1305 is another example that covers Mul64, Add64 and L/R Shift 128/n -> 64

bmakam-qdt on 24 Apr 2018

@rsc What is the rationale for not including the uint-sized versions of Mul and Div as well? Those are both internal primitives in math/big.

// A Word represents a single digit of a multi-precision unsigned integer.
type Word uint
func mulWW(x, y Word) (z1, z0 Word)
func divWW(x1, x0, y Word) (q, r Word)

bmkessler on 26 Apr 2018

We did not include the 32-bit ones as they are easily implemented with uint64, but I agree that the uint sized ones would be good to have anyway.

FiloSottile on 26 Apr 2018

@rsc @FiloSottile is this accepted? It sounds so from rsc comment but the tag wasn't set.

Also please notice this CL:
https://go-review.googlesource.com/c/go/+/91755

that introduces runtime.mul(x,y uint64) (uint64, bool), based on discussion in other issue which I cannot find now... /cc @martisch

rasky on 13 May 2018

The issue for runtime.mul is: #21588. The use case there is to avoid costly overflow checks involving divisions or table lookups for runtime arithmetic when allocating e.g. slices/channels/maps ... . I think checking the overflow can be easier and more efficient than carrying out the whole wide mul/add operation and checking the result. Maybe "if x, y = Mul(a, b); x > 0" where a,b are uint can be detected and optimized by the compiler for both 32bit and 64bit.

Both issues likely share some backend/ssa work and could alias some of the same intrinsic and approaches. I did not get to finish https://go-review.googlesource.com/c/go/+/91755 for go1.11 but planning to work on it again until go1.12.

martisch on 13 May 2018

Can't runtime.mul be implemented by pattern matching

hi, lo := bits.Mul64(x, y)
carry := hi != 0

instead?

I think that between runtime, poly1305 and ed25519 we have enough examples
to start prototyping this.

On Sun, May 13, 2018 at 3:22 AM Martin Möhrmann notifications@github.com
wrote:

The issue for runtime.mul is: #21588
https://github.com/golang/go/issues/21588. The use case there is to
avoid costly overflow checks involving devisions or table lookups for
runtime arithmetic when allocating e.g. slices/channels/maps ... . I think
checking the overflow can be easier and more efficient than carrying out
the whole wide mul/add operation and checking the result. Both issues
likely share some backend/ssa work and could alias some of the same
intrinsics and approaches.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/golang/go/issues/24813#issuecomment-388606879, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABKyTqQ30hxugWIjwDQ2ZftSunyCTIRYks5tx98dgaJpZM4TQzez
.

FiloSottile on 14 May 2018

Can't runtime.mul be implemented by pattern matching

It can but it if its for performance it ideally should not actually carry out the wide multiplication (if not done by the architecture anyway for free vs normal multiplication) but should e.g. just check jump on the overflow flag of the normal multiplication and support uint size without specifying 64 or 32 bit mul. Runtime would need to be whitelisted to import bits as currently some sys functions are just aliased to the same intrinsics with bits.

https://go-review.googlesource.com/c/go/+/91755 is kind of a prototype for the runtime multiplication overflow check. I am (and by feedback from others) just not happy that it needs to introduce a new instruction in the ssa backend and would rather it worked by combining the existing mul instruction with a jmp instruction on amd64. This also would lead to optimization rules being able to apply more broadly and not dealing with another Op.

At any rate the above CL is/was independent from any math.bits prototyping that can be started.

martisch on 14 May 2018

Revising my comment from May, the proposal is to add these to math/bits:

// Add with carry
// The carry inputs are assumed to be 0 or 1; otherwise behavior undefined.
// The carryOut outputs are guaranteed to be 0 or 1.
func Add(x, y, carry uint) (sum, carryOut uint)
func Add32(x, y, carry uint32) (sum, carryOut uint32)
func Add64(x, y, carry uint64) (sum, carryOut uint64)

// Subtract with borrow
// The borrow inputs are assumed to be 0 or 1; otherwise behavior undefined.
// The borrowOut outputs are guaranteed to be 0 or 1.
func Sub(x, y, borrow uint) (difference, borrowOut uint)
func Sub32(x, y, borrow uint32) (difference, borrowOut uint32)
func Sub64(x, y, borrow uint64) (difference, borrowOut uint64)

// Full-width multiply: 32x32->64, 64x64->128
func Mul(x, y uint) (hi, lo uint)
func Mul32(x, y uint32) (hi, lo uint32)
func Mul64(x, y uint64) (hi, lo uint64)

// Full-width divide: 64/32 -> 32,32, 128/64 -> 64,64
// Behavior undefined if hi >= x (because quotient will not fit).
func Div(hi, lo, x uint) (quo, rem uint)
func Div32(hi, lo, x uint32) (quo, rem uint32)
func Div64(hi, lo, x uint64) (quo, rem uint64)

Compared to my earlier comment, I've added uint-sized Mul and Div to complete the set. It looks odd not to do so, especially if one is interested in writing efficient code that doesn't assume a particular size of int/uint. We still don't need 8-bit or 16-bit versions of these since all platforms have uint at least 32 bits, so those can be implemented in terms of uint (or uint32) operations.

It's fine to add a pattern match in the compiler to allow Mul32 to be implemented as

func Mul32(x, y uint32) (hi, lo uint32) {
    z := uint64(x)*uint64(y)
    return uint32(z>>32), uint32(z)
}

but Mul32 should still be included, for the same reasons we included bits.RotateLeft: having the API allows for clearer code and avoids bugs caused by the magic sequence being written over and over. (My first attempt at writing that code typoed to z>>3, for example.)

For now we will leave out shift, because there's not one agreed-upon meaning for a double-precision shift and it's easy to write clear, correct, efficient code for the meaning needed in a given context. (In contrast, it is tricky or tedious or both to write add/sub/mul/div correctly.)

Dropping Proposal-Crypto label, because this should be considered at the main proposal review meeting. I think this has come up often enough and the meaning well-understood enough that we could decide to add these for Go 1.12 without needing crypto benchmark results. It also matters for #25819.

rsc on 18 Jun 2018

❤3

@rsc I think we need also a signed version for mul and div, as those cannot be easily implemented using the unsigned version.

Also what about division by zero? Would that panic?

rasky on 18 Jun 2018

@rasky Do you have a use case for the signed versions?

FiloSottile on 18 Jun 2018

Any case where signed multiplication is used in computer science could benefit from having a stable low/hi API. I sometimes work with geometry and signal processing, for instance, and on ARMs you don't always have a FPU so using fixed points is an option. With a 32-bit fixed point, you need a 32x32->64 multiplication.

rasky on 18 Jun 2018

Proposal accepted per @rsc's comment https://github.com/golang/go/issues/24813#issuecomment-398096235 -- iant for @golang/proposal-review

ianlancetaylor on 18 Jun 2018

Porting musl's fma implementation to Go required Mul64 as well. Here is a possible full-width multiply implementation.
Edit: Link to musl libc

// Mul returns the high and low bits from the product of x and y.
func Mul(x, y uint) (hi, lo uint) {
    if UintSize == 32 {
        hi32, lo32 := Mul32(uint32(x), uint32(y))
        return uint(hi32), uint(lo32)
    }
    hi64, lo64 := Mul64(uint64(x), uint64(y))
    return uint(hi64), uint(lo64)
}

// Mul32 returns the high and low 32 bits from the 64-bit product of x and y.
func Mul32(x, y uint32) (hi, lo uint32) {
    z := uint64(x) * uint64(y)
    return uint32(z >> 32), uint32(z)
}

// Mul64 returns the high and low 64 bits from the 128-bit product of x and y.
func Mul64(x, y uint64) (hi, lo uint64) {
    xlo := uint64(uint32(x))
    xhi := x >> 32
    ylo := uint64(uint32(y))
    yhi := y >> 32

    t1 := xlo * ylo
    t2 := xlo*yhi + xhi*ylo
    t3 := xhi * yhi
    lo = t1 + (t2 << 32)
    hi = t3 + (t2 >> 32)
    if t1 > lo {
        hi += 1
    }
    return hi, lo
}

smasher164 on 8 Jul 2018

I have a prototype of the extended precision operations over at github.com/smasher164/extprec. All the implementations are adapted from "Hacker's Delight", and the relevant references are provided.

However, it still needs real test cases in order to hammer out any possible bugs. I've only tested it with random hand-picked values.

Edit: Completed unit tests!

smasher164 on 9 Jul 2018

Change https://golang.org/cl/123157 mentions this issue: math/bits: add extended precision Add, Sub, Mul, Div

gopherbot on 10 Jul 2018

The above CL ports the go versions of these functions from math/big since those are previously tested. Additional SSA rules will still need to be implemented as this just handles the simple Mul and Div cases for amd64 and arm64 that are used in math/big.

bmkessler on 11 Jul 2018

Really tiny nitpick. Wouldn't it make more sense to have func Div(hi, lo, y uint), instead of func Div(hi, lo, x uint)? Because in all other functions, x is the first operand and y the second one. hi and lo are 128-bits of the first operand so maybe an y would be more descriptive.

TuomLarsen on 3 Aug 2018

Change https://golang.org/cl/129415 mentions this issue: cmd/compile: intrinsify math/bits.Mul and Div

gopherbot on 15 Aug 2018

I was curious about this, so I cherry-picked the relevant CLs on top of current Go master and have some preliminary results from my Ed25519 field arithmetic. bits.Mul64 got me about halfway to the speed of an assembly implementation without much effort. I suspect I could make the pure Go a bit faster as well, but haven't tried yet.

Rough benchmarks on a Kaby Lake laptop (i7-7560U @ 2.40GHz):

$ benchstat old32 bits64
name             old time/op  new time/op  delta
KeyGeneration-4  48.5µs ± 0%  33.9µs ± 2%  -30.22%  (p=0.004 n=5+6)
Signing-4        49.6µs ± 1%  34.8µs ± 2%  -29.84%  (p=0.002 n=6+6)
Verification-4    133µs ± 1%    93µs ± 3%  -29.81%  (p=0.002 n=6+6)
$ benchstat bits64 amd64
name             old time/op  new time/op  delta
KeyGeneration-4  33.9µs ± 2%  24.9µs ± 0%  -26.36%  (p=0.001 n=6+7)
Signing-4        34.8µs ± 2%  25.9µs ± 0%  -25.70%  (p=0.001 n=6+7)
Verification-4   93.2µs ± 3%  60.5µs ± 0%  -35.06%  (p=0.002 n=6+6)
$ benchstat old32 amd64
name             old time/op  new time/op  delta
KeyGeneration-4  48.5µs ± 0%  24.9µs ± 0%  -48.61%  (p=0.003 n=5+7)
Signing-4        49.6µs ± 1%  25.9µs ± 0%  -47.87%  (p=0.001 n=6+7)
Verification-4    133µs ± 1%    61µs ± 0%  -54.42%  (p=0.002 n=6+6)

If anyone else is interested in playing with this, it's this code and this Go.

gtank on 3 Sep 2018

🎉4

@gtank your mul64 function (which is actually a mad64 AFAICT) should use bits.Add to implement double-word add:

func mul64(accLo, accHi, x, y uint64) (ol, oh uint64) {
    oh, ol = bits.Mul64(x, y)
        var carry uint64
        ol, carry = bits.Add64(ol, accLo, 0)
        oh, _ = bits.Add64(oh, accHi, carry)
    return
 }

at some point that should get turned into ADD+ADC, which would probably get you another boost compared to assembly.

rasky on 15 Sep 2018

Change https://golang.org/cl/138917 mentions this issue: cmd/compile: instrinsify math/bits.Mul on ppc64x

gopherbot on 2 Oct 2018

Should this issue be closed now that CL 123157 has been merged in?

smasher164 on 17 Oct 2018

Maybe? @randall77, are there plans to intrinsify more of this?

bradfitz on 17 Oct 2018

Only Mul has been intrinsified so far.
I think we want Add/Sub intrinsified also (or better, recognize the Go code and generate the add-with-carry instruction).
That said, those tasks don't need to live in this issue, which is more about API. If concerned people can open issues for their architecture, we can close this issue.

randall77 on 18 Oct 2018

Okay, closing.

bradfitz on 18 Oct 2018

Change https://golang.org/cl/157717 mentions this issue: strconv: simply (*extFloat).Multiply using math/bits.Mul64

gopherbot on 13 Jan 2019

Go: math/bits: add extended precision Add, Sub, Mul, Div

Most helpful comment

All 74 comments

Related issues