Go: cmd/compile: performance problems with many long, exported identifiers

Created on 11 Jan 2017  ·  22Comments  ·  Source: golang/go

What version of Go are you using (go version)?

1.7, 1.8+

What operating system and processor architecture are you using (go env)?

win64 and nix64

What did you do?

Tried to compile 150MB of code in a single package. Really it was a simple main.go that referenced this huge package and I executed "go build" but it is the compile executable that consumed ridiculous resources. But this even happens in my case when I get the code down to half that size (~75 MB). Not removing inlining and/or optimizations only delays the extreme resource usage, it does not remove it.

What did you expect to see?

A successful result and timid RAM usage that does not climb to extreme proportions and "streams" the build limited only by disk space of the result.

What did you see instead?

Consumed proportional amount of memory to code size, running out of memory (Win) or swapping forever until receiving a kill signal (Nix).

I am afraid I cannot provide the large set of code at the immediate moment. But if necessary, I can build a go generator that generates a ton of interfaces, structs, and functions. Orig thread: https://groups.google.com/forum/#!topic/golang-nuts/sBBkQ1_xf2Q.

NeedsInvestigation ToolSpeed

Most helpful comment

I'll leave a compile running overnight to see whether I can get some useful pprof output.

OnUnmappableCharacter__desc____obj__Java__nio__charset__CodingErrorAction__ret____obj__Java__nio__charset__CharsetEncoder -- wow.

All 22 comments

Without a specific repro, I don't foresee many people jumping on this bug.

But once you have a repro, it becomes much more exciting to debug and fix.

@bradfitz I was hoping the bug was obvious ("large code can't be compiled") but I understand the need to reproduce. I will take some time to write an extreme Go code generator to replicate my issue (or maybe post a link to a large set of code causing this issue).

@cretz, no, it's not obvious at all. It's not even obvious whether you're hitting problems in the compiler (e.g. the SSA optimization phases, and there are many), or in the linker. Every time a bug like this has come up, it's had a very unique cause and fix.

@bradfitz - Code to reproduce: repro-18602.zip

I was able to get ~130MB of the code to compress well down to 10MB so I could paste in this issue. See the link above. It's a transpiler I am working on here. Simply extract the ZIP, navigate into the folder extracted to, and run go build to see the issue. For me on Windows:

    # _/C_/work/temp/repro/rt
    runtime: VirtualAlloc of 1048576 bytes failed with errno=1455
    fatal error: runtime: cannot map pages in arena address space

    runtime stack:
    runtime.throw(0x921776, 0x30)
            c:/go/src/runtime/panic.go:566 +0x9c
    runtime.sysMap(0xc159810000, 0x100000, 0x7fc01, 0xb7eb38)
            c:/go/src/runtime/mem_windows.go:116 +0x12c
    runtime.(*mheap).sysAlloc(0xb5cbc0, 0x100000, 0xb5cbd8)
            c:/go/src/runtime/malloc.go:407 +0x381
    runtime.(*mheap).grow(0xb5cbc0, 0x1, 0x0)
            c:/go/src/runtime/mheap.go:726 +0x69
    runtime.(*mheap).allocSpanLocked(0xb5cbc0, 0x1, 0xb4afc0)
            c:/go/src/runtime/mheap.go:630 +0x4f9
    runtime.(*mheap).alloc_m(0xb5cbc0, 0x1, 0xa, 0xa65c908)
            c:/go/src/runtime/mheap.go:515 +0xee
    runtime.(*mheap).alloc.func1()
            c:/go/src/runtime/mheap.go:579 +0x52
    runtime.systemstack(0x7fdb0)
            c:/go/src/runtime/asm_amd64.s:314 +0xb5
    runtime.(*mheap).alloc(0xb5cbc0, 0x1, 0x1000000000a, 0x0)
            c:/go/src/runtime/mheap.go:580 +0x7a
    runtime.(*mcentral).grow(0xb5e2f0, 0x0)
            c:/go/src/runtime/mcentral.go:210 +0x9b
    runtime.(*mcentral).cacheSpan(0xb5e2f0, 0xa65c908)
            c:/go/src/runtime/mcentral.go:91 +0x101
    runtime.(*mcache).refill(0x1a0000, 0xc10000000a, 0xa65c908)
            c:/go/src/runtime/mcache.go:121 +0xbc
    runtime.(*mcache).nextFree.func1()
            c:/go/src/runtime/malloc.go:505 +0x3a
    runtime.systemstack(0xb4b000)
            c:/go/src/runtime/asm_amd64.s:298 +0x7e
    runtime.mstart()
            c:/go/src/runtime/proc.go:1079

    goroutine 1 [running]:
    runtime.systemstack_switch()
            c:/go/src/runtime/asm_amd64.s:252 fp=0xc105a60690 sp=0xc105a60688
    runtime.(*mcache).nextFree(0x1a0000, 0xa, 0x0, 0xd, 0xd)
            c:/go/src/runtime/malloc.go:506 +0xb9 fp=0xc105a606e8 sp=0xc105a60690
    runtime.mallocgc(0x90, 0x900300, 0x1, 0xc1597f8bd0)
            c:/go/src/runtime/malloc.go:658 +0x848 fp=0xc105a60788 sp=0xc105a606e8
    runtime.newobject(0x900300, 0xd)
            c:/go/src/runtime/malloc.go:785 +0x3f fp=0xc105a607b8 sp=0xc105a60788
    cmd/compile/internal/gc.Nod(0xc105a60701, 0x0, 0x0, 0x6c6e692e00000001)
            c:/go/src/cmd/compile/internal/gc/subr.go:335 +0x38 fp=0xc105a607e0 sp=0xc105a607b8
    cmd/compile/internal/gc.newname(0xc1597f8bd0, 0x7)
            c:/go/src/cmd/compile/internal/gc/dcl.go:334 +0x51 fp=0xc105a60820 sp=0xc105a607e0
    cmd/compile/internal/gc.newlabel_inl(0xc105a609d8)
            c:/go/src/cmd/compile/internal/gc/inl.go:883 +0x68 fp=0xc105a60850 sp=0xc105a60820
    cmd/compile/internal/gc.mkinlcall1(0xc07db27320, 0xc07da78630, 0x0, 0x0)
            c:/go/src/cmd/compile/internal/gc/inl.go:769 +0xddc fp=0xc105a60d98 sp=0xc105a60850
    cmd/compile/internal/gc.mkinlcall(0xc07db27320, 0xc07da78630, 0x0, 0x0)
            c:/go/src/cmd/compile/internal/gc/inl.go:500 +0x7f fp=0xc105a60dd0 sp=0xc105a60d98
    cmd/compile/internal/gc.inlnode(0xc07db27320, 0x128fdf)
            c:/go/src/cmd/compile/internal/gc/inl.go:481 +0x948 fp=0xc105a60fe8 sp=0xc105a60dd0
    cmd/compile/internal/gc.inlnode(0xc07db273b0, 0x128fdf)
            c:/go/src/cmd/compile/internal/gc/inl.go:365 +0x11e fp=0xc105a61200 sp=0xc105a60fe8
    cmd/compile/internal/gc.inlnode(0xc07db274d0, 0xc10d8ab440)
            c:/go/src/cmd/compile/internal/gc/inl.go:365 +0x11e fp=0xc105a61418 sp=0xc105a61200
    cmd/compile/internal/gc.inlnodelist(0xc07db28180)
            c:/go/src/cmd/compile/internal/gc/inl.go:318 +0x69 fp=0xc105a61458 sp=0xc105a61418
    cmd/compile/internal/gc.inlnode(0xc07db27560, 0xc07db26cf0)
            c:/go/src/cmd/compile/internal/gc/inl.go:379 +0x1c6 fp=0xc105a61670 sp=0xc105a61458
    cmd/compile/internal/gc.inlnodelist(0xc07db29a80)
            c:/go/src/cmd/compile/internal/gc/inl.go:318 +0x69 fp=0xc105a616b0 sp=0xc105a61670
    cmd/compile/internal/gc.inlnode(0xc07daf4ab0, 0x0)
            c:/go/src/cmd/compile/internal/gc/inl.go:436 +0x353 fp=0xc105a618c8 sp=0xc105a616b0
    cmd/compile/internal/gc.inlcalls(0xc07daf4ab0)
            c:/go/src/cmd/compile/internal/gc/inl.go:274 +0x57 fp=0xc105a61908 sp=0xc105a618c8
    cmd/compile/internal/gc.Main.func1(0xc14134f008, 0x6, 0xff, 0xc105a61900)
            c:/go/src/cmd/compile/internal/gc/main.go:435 +0x71 fp=0xc105a61938 sp=0xc105a61908
    cmd/compile/internal/gc.(*bottomUpVisitor).visit(0xc105a61b88, 0xc07daf4ab0, 0x393fe)
            c:/go/src/cmd/compile/internal/gc/esc.go:106 +0x2b5 fp=0xc105a619d8 sp=0xc105a61938
    cmd/compile/internal/gc.(*bottomUpVisitor).visitcode(0xc105a61b88, 0xc07daf25a0, 0x393fe, 0x393fe)
            c:/go/src/cmd/compile/internal/gc/esc.go:137 +0x175 fp=0xc105a61a08 sp=0xc105a619d8
    cmd/compile/internal/gc.(*bottomUpVisitor).visitcode(0xc105a61b88, 0xc07daf2630, 0xc1000393fe, 0x393fe)
            c:/go/src/cmd/compile/internal/gc/esc.go:126 +0x9c fp=0xc105a61a38 sp=0xc105a61a08
    cmd/compile/internal/gc.(*bottomUpVisitor).visitcode(0xc105a61b88, 0xc07daf26c0, 0x393fe, 0x393fe)
            c:/go/src/cmd/compile/internal/gc/esc.go:126 +0x9c fp=0xc105a61a68 sp=0xc105a61a38
    cmd/compile/internal/gc.(*bottomUpVisitor).visitcodelist(0xc105a61b88, 0xc07dacde80, 0xc1000393fe, 0xc105a61b0c)
            c:/go/src/cmd/compile/internal/gc/esc.go:114 +0x6d fp=0xc105a61ab0 sp=0xc105a61a68
    cmd/compile/internal/gc.(*bottomUpVisitor).visit(0xc105a61b88, 0xc07daec480, 0xc1000393f0)
            c:/go/src/cmd/compile/internal/gc/esc.go:80 +0x143 fp=0xc105a61b50 sp=0xc105a61ab0
    cmd/compile/internal/gc.visitBottomUp(0xc12ed46000, 0x432b4, 0x4c000, 0x9625c8)
            c:/go/src/cmd/compile/internal/gc/esc.go:62 +0xdd fp=0xc105a61bc8 sp=0xc105a61b50
    cmd/compile/internal/gc.Main()
            c:/go/src/cmd/compile/internal/gc/main.go:438 +0x1c50 fp=0xc105a61e10 sp=0xc105a61bc8
    cmd/compile/internal/amd64.Main()
            c:/go/src/cmd/compile/internal/amd64/galign.go:93 +0x301 fp=0xc105a61e48 sp=0xc105a61e10
    main.main()
            c:/go/src/cmd/compile/main.go:33 +0x2aa fp=0xc105a61ed8 sp=0xc105a61e48
    runtime.main()
            c:/go/src/runtime/proc.go:183 +0x1f0 fp=0xc105a61f30 sp=0xc105a61ed8
    runtime.goexit()
            c:/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc105a61f38 sp=0xc105a61f30

I was afraid it was a generic, obvious issue due to the extreme size of the code and that it's all in one package. I don't expect the code to run (I figure it'd panic, I don't know, never got it to compile). But if it's a nuanced issue that would be fantastic.

CC @josharian @randall77

I'll leave a compile running overnight to see whether I can get some useful pprof output.

OnUnmappableCharacter__desc____obj__Java__nio__charset__CodingErrorAction__ret____obj__Java__nio__charset__CharsetEncoder -- wow.

Building Kubernetes produces several very large programs. There could be a reproducer there.
https://github.com/kubernetes/kubernetes.git
Set up your GOPATH so it gets this:
go get -u github.com/jteeuwen/go-bindata/go-bindata
Then adds its bin directory to your PATH.
Depending on your platform, cd kubernetes, then make.

Just a suggestion....

Compile with memory profiling complete. Took almost 7 hrs, 13gb ram:

    24205.02 real     28947.74 user      1399.21 sys
12711182336  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
  95223080  page reclaims
        89  page faults
         0  swaps
       111  block input operations
        44  block output operations
         0  messages sent
         0  messages received
         3  signals received
   2251562  voluntary context switches
  53741995  involuntary context switches

Longest phases were:

  • parse: 11m
  • compilefuncs: 15m
  • dumpobj: 6h30m!

The resulting object file was 1.14gb!

Memory profile, using -alloc_space:

o.pdf

The only unusual things I see in the profile are:

  • ssa.fuseBlockPlain: Never seen this be significant in a memory profile before. Probably a result of unusual machine-generated code patterns. Once my second compile (cpu profile) is complete, I'll check for obvious optimizations here. In any case, the panics in the original report indicate that compilation never even made it to the backend, where this function is.

  • Lots of allocation rooted at methodsym and methodfunc and typefmt for TFUNC. I think this is a direct result of the extraordinarily long names in the code, particularly in interfaces. Note that a long name used in an interface gets used in many places: In the interface representation, as part of the method set of any type satisfying that interface, as part of the name of many functions, etc.

As for the extremely long dumpobj time and giant object file (which will lead to massive link times and/or giant binaries), I suspect that that is again the result of very many very long exported identifiers. Hopefully when my cpu profile run completes in five hours, it'll be much clearer whether there are obvious non-linearities in dumpobj.

@cretz though there may be some improvements available in the compiler, I don't anticipate anything breathtaking; you will almost certainly also need to change your generated code as well. As a start, I suggest altering your code generator to (a) avoid exporting any identifiers you reasonably can and (b) give your symbols shorter names, perhaps using truncated hashes or a global numbering scheme. (You'll probably also then need to generate a lookup table or some such to help with demangling, for debugging. Sorry about that.) If you do those things, please report back with the results, and maybe with an updated code dump. Thanks!

@laboger thanks, but to mangle Tolstoy, every interminable compile is interminable in its own way.

@josharian - Thanks for the update. I figured my use case was just really extreme. I can unexport some things. I can also reduce the identifier size (I wanted my identifiers to help w/ runtime reflection, but no big deal, I can maintain a separate blob w/ that info). I was hoping that even though the obj file is big, that DCE would remove a lot of bloat from the final binary due to having most functions as methods, but I am unfamiliar w/ the internals of the linker.

"whether there are obvious non-linearities in dumpobj"

I think this might be a key point in general. Ideally some of the work can be streamed and not held in memory for the life of the package compile.

At least now y'all have a good stress test case.

If it's really very long identifiers that cause problems with the compile time, we should try to get to the bottom of this, rather than find work-arounds. Exporting long identifiers shouldn't cause undue compilation times - there's a bug somewhere.

As one data point: The export data writer uses a map to canonicalize strings - that is, a string that's been seen before (this includes exported identifiers) will only appear once in the export data. But the same identifiers may appear elsewhere in the obj file.

Here's a CPU profile:

cf.pdf

Aha!

Hello, gc.testdclstack. This is not the first time we've had problems with testdclstack. See #14781. Robert suggested only enabling it in a special debug mode in the compiler. It is probably time to do that, perhaps even for Go 1.8. I'll see about sending a CL soon.

With gc.testdclstack eliminated, the parse phase drops from 11m to 13s. Still waiting to see how much it helps elsewhere.

Eliminating gc.testdclstack won't help with memory usage, though. My compile is still at 7gb and growing. ssa.fuseBlockPlain changes may help, experimentation still to come.

If it's really very long identifiers that cause problems with the compile time, we should try to get to the bottom of this, rather than find work-arounds.

I don't think it's just very long exported identifiers. It is also the sheer number of them, and probably also some structural things (per the other comments I've made). Squinting at the profiles, the long identifiers is maybe 10% of the memory issue; I just suggested it to @cretz as a good, straightforward first step (and experiment to confirm what I'm seeing).

@josharian Excellent, thanks for tracking this down. We should perhaps disable testdclstack (in non-debug mode) even for 1.8. It's just a verification step.

@griesemer, @josharian, send a CL soon if so.

CL https://golang.org/cl/35113 mentions this issue.

CL https://golang.org/cl/35114 mentions this issue.

For anyone wanting it, here is a test case w/ smaller identifiers: repro-18602-smaller-idents.zip. Granted I still think repro-18602.zip is quality stress test case too.

After the CLs above, time is reduced to a half hour, and max rss is down a bit:

     1461.67 real      1588.13 user       374.84 sys
11645718528  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
  38278408  page reclaims
        75  page faults
         0  swaps
       312  block input operations
       193  block output operations
         0  messages sent
         0  messages received
         3  signals received
    167410  voluntary context switches
  11764445  involuntary context switches
  • parse now takes 11s.
  • dumpobj now takes 7m.

For reference, here's an alloc_space profile output:

alloc_space_after_fuse.pdf

Aside from the things I've already mentioned, disabling inlining would probably help noticeably now with memory usage.

There might be further optimizations available to further speed up dumpobj or shrink the object file size by reusing more strings somewhere, but I'll leave that to @griesemer (export info) and @crawshaw (reflect info).

Thanks for the new test case; I'll take a look at that later or tomorrow.

Was testdclstack involved in dumpobj? Or what's the reason for the 55x reduction there?

Was testdclstack involved in dumpobj?

Yes.

At least on my machine, the new code you posted compiles by ~10-15% faster, but memory usage doesn't shrink significantly; I guess I was wrong. The object file is still 856mb, though, so you're still probably going to have slow linking and a very large (and probably slow) binary.

I don't plan to investigate this further at the moment.

No prob. I appreciate that y'all are leaving this open so it can be revisited in the future if anyone wants a crazy test case for compiler performance.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Miserlou picture Miserlou  ·  3Comments

michaelsafyan picture michaelsafyan  ·  3Comments

ashb picture ashb  ·  3Comments

myitcv picture myitcv  ·  3Comments

natefinch picture natefinch  ·  3Comments