Go: cmd/compile: inefficient code generation on amd64

Created on 27 Jul 2020  路  5Comments  路  Source: golang/go

Probably, inefficient code generation on amd64.

Probably a slight regression in the performance of the resulting code beetween g.14.6 and master (8696ae82c9).

Preface

Examining the assembly code listing, I found out that the compiler generates sub-optimal code for some part of the code.
I will not describe the whole journey. I found a code (code from benchmark game) illustrating the problem and will use it for an example.

All source code: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/nbody-go-3.html

Let's pay attention to these lines (60, 61, 62), it's "advance" func:

dx := sys.s[i].x - sys.s[j].x
dy := sys.s[i].y - sys.s[j].y
dz := sys.s[i].z - sys.s[j].z

Let's take a look at the result of the code generation:

  main.go:60    LEAQ 0(CX)(CX*2), DI         // DI = i, BX = j
  main.go:60    LEAQ main.sys+160(SB), R8

  main.go:60    MOVSD_XMM 0(R8)(DI*8), X4    // load x +0x0
  main.go:60    LEAQ 0(BX)(BX*2), R9         // calculate offset for .x
  main.go:60    LEAQ 0(R8)(R9*8), R10        // sys.s[j] + offset
  main.go:60    SUBSD 0(R10), X4

  main.go:61    MOVSD_XMM 0x8(R8)(DI*8), X5  // load y +0x8
  main.go:61    LEAQ 0(R8)(R9*8), R10
  main.go:61    LEAQ 0x8(R10), R10
  main.go:61    SUBSD 0(R10), X5

  main.go:62    MOVSD_XMM 0x10(R8)(DI*8), X6 // load z +0x10
  main.go:62    LEAQ 0(R8)(R9*8), DI
  main.go:62    LEAQ 0x10(DI), DI
  main.go:62    SUBSD 0(DI), X6

We need one instruction to load the value of the left operand and two to calculate the address of the right one.

I rolled back to version go1.14 and found that the codegen there is different. For the same lines, the following code will be generated:

  main.go:60    0x49f77d    LEAQ 0(CX)(CX*2), DI              // DI = 0
  main.go:60    0x49f781    LEAQ main.sys+160(SB), R8         // R8 = Positions addr

  main.go:60    0x49f788    MOVSD_XMM 0(R8)(DI*8), X4         // load sys.s[i].x
  main.go:60    0x49f78e    LEAQ 0(BX)(BX*2), R9
  main.go:60    0x49f792    MOVSD_XMM 0(R8)(R9*8), X5         // load sys.s[j].x
  main.go:60    0x49f798    SUBSD X5, X4

  main.go:61    0x49f79c    MOVSD_XMM 0x8(R8)(DI*8), X5
  main.go:61    0x49f7a3    MOVSD_XMM 0x8(R8)(R9*8), X6       // dy := sys.s[i].y - sys.s[j].y
  main.go:61    0x49f7aa    SUBSD X6, X5

  main.go:62    0x49f7ae    MOVSD_XMM 0x10(R8)(DI*8), X6
  main.go:62    0x49f7b5    MOVSD_XMM 0x10(R8)(R9*8), X7      // dz := sys.s[i].z - sys.s[j].z
  main.go:62    0x49f7bc    SUBSD X7, X6

It uses one instruction to load both the left and right side of the expression.

How does it affect code performance?

It depends on how often such a code generation pattern is used in the code and in how hot the pieces of code are.
But we can write benchmark:

package main

import (
    "testing"
)

func BenchmarkT(b *testing.B) {
    offsetMomentum()
    c := 500000

    b.ResetTimer()
    for n := 0; n < b.N; n++ {
        for i := 0; i < c; i++ {
            advance(0.01)
        }
    }

    _ = energy()
}

Results:

$ bin for i in {1..20}; do go.1.14 test -bench=. >> old.txt; done
$ bin for i in {1..20}; do go.master test -bench=. >> new.txt; done
$ bin benchstat old.txt new.txt
name  old time/op  new time/op  delta
T-4   48.9ms 卤 5%  54.5ms 卤10%  +11.46%  (p=0.000 n=19+19)

When did the issue start?

git bisect report beetween master(8696ae82c9) and go1.14:

```shell script
... skip all steps

98cb76799c3053e779c4e1b61bb50705d25dd77f is the first bad commit
commit 98cb76799c3053e779c4e1b61bb50705d25dd77f
Author: Keith Randall khr@golang.org
Date: Thu Jan 30 10:17:01 2020 -0800

cmd/compile: insert complicated x86 addressing modes as a separate pass

Use a separate compiler pass to introduce complicated x86 addressing
modes.  Loads in the normal architecture rules (for x86 and all other
platforms) can have constant offsets (AuxInt values) and symbols (Aux
values), but no more.

The complex addressing modes (x+y, x+2*y, etc.) are introduced in a
separate pass that combines loads with LEAQx ops.

Organizing rewrites this way simplifies the number of rewrites
required, as there are lots of different rule orderings that have to
be specified to ensure these complex addressing modes are always found
if they are possible.

Update #36468

Change-Id: I5b4bf7b03a1e731d6dfeb9ef19b376175f3b4b44
Reviewed-on: https://go-review.googlesource.com/c/go/+/217097
Run-TryBot: Keith Randall <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Josh Bleecher Snyder <[email protected]>

src/cmd/compile/internal/ssa/addressingmodes.go | 154 +
src/cmd/compile/internal/ssa/compile.go | 1 +
src/cmd/compile/internal/ssa/gen/AMD64.rules | 699 +-
src/cmd/compile/internal/ssa/rewrite.go | 40 +
src/cmd/compile/internal/ssa/rewriteAMD64.go | 10427 ++++++----------------
test/codegen/memops.go | 88 +
6 files changed, 3233 insertions(+), 8176 deletions(-)
create mode 100644 src/cmd/compile/internal/ssa/addressingmodes.go
```

Issue linked with changes - https://github.com/golang/go/issues/36468

/cc @randall77

NeedsInvestigation Performance

Most helpful comment

Yes, that CL improve stats:

$ benchstat regression.txt fix.txt 
name  old time/op  new time/op  delta
T-4   54.5ms 卤10%  47.7ms 卤 0%  -12.44%  (p=0.000 n=19+20)`

And compare with go1.14:

$ benchstat go.14.txt fix.txt 
name  old time/op  new time/op  delta
T-4   48.9ms 卤 5%  47.7ms 卤 0%  -2.41%  (p=0.000 n=19+20)

All 5 comments

/cc @randall77

Yes, looks like we need corresponding float support. I have a CL.

Change https://golang.org/cl/244859 mentions this issue: cmd/compile: add floating point load+op operations to addressing modes pass

With that CL you should get better code than even 1.14:

        0x000d 00013 (/Users/khr/gowork/issue40426.go:69)       LEAQ    (CX)(CX*2), DI
        0x0011 00017 (/Users/khr/gowork/issue40426.go:69)       LEAQ    "".sys+160(SB), R8
        0x0018 00024 (/Users/khr/gowork/issue40426.go:69)       MOVSD   (R8)(DI*8), X4
        0x001e 00030 (/Users/khr/gowork/issue40426.go:69)       LEAQ    (BX)(BX*2), R9
        0x0022 00034 (/Users/khr/gowork/issue40426.go:69)       SUBSD   (R8)(R9*8), X4
        0x0028 00040 (/Users/khr/gowork/issue40426.go:70)       MOVSD   8(R8)(DI*8), X5
        0x002f 00047 (/Users/khr/gowork/issue40426.go:70)       SUBSD   8(R8)(R9*8), X5
        0x0036 00054 (/Users/khr/gowork/issue40426.go:71)       MOVSD   16(R8)(DI*8), X6
        0x003d 00061 (/Users/khr/gowork/issue40426.go:71)       SUBSD   16(R8)(R9*8), X6

Not sure if it would be faster. Please try it and let me know.

Yes, that CL improve stats:

$ benchstat regression.txt fix.txt 
name  old time/op  new time/op  delta
T-4   54.5ms 卤10%  47.7ms 卤 0%  -12.44%  (p=0.000 n=19+20)`

And compare with go1.14:

$ benchstat go.14.txt fix.txt 
name  old time/op  new time/op  delta
T-4   48.9ms 卤 5%  47.7ms 卤 0%  -2.41%  (p=0.000 n=19+20)
Was this page helpful?
0 / 5 - 0 ratings