Cockroach: CockroachDB build on ARMv8 (aarch64, arm64) Ubuntu 16.04 + Go 1.8

Created on 28 Mar 2017  Â·  32Comments  Â·  Source: cockroachdb/cockroach

BUG REPORT

CockroachDB fails to build successfully on ARMv8 (Packet 2A, 96-core Cavium) with Ubuntu 16.04 and Go 1.8.

A full log of the build process is at this gist:

https://gist.github.com/96f20c94fa4a1c1afec98e33c164935a

It looks like there are several intertwined problems, so there might need to be a couple more issues opened to address specifics.

  1. Please describe the issue you observed:
  • What did you do?

Installed Go 1.8 from a Ubuntu PPA, downloaded CockroachDB, adjusted GOPATH and GOROOT, built with "make -j 96".

  • What did you expect to see?

Successful build.

  • What did you see instead?

Several errors during the build process. Specifically in c-jemalloc:

# github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/c-jemalloc
/tmp/ccDvR2dF.s: Assembler messages:
/tmp/ccDvR2dF.s:225: Error: unknown mnemonic `pause' -- `pause'

and multiple errors resulting from a call to clog.go, which is https://github.com/golang/go/issues/11981 (missing Dup2 system call in arm64; use Dup3 instead)

    --- FAIL: TestStyle/TestReturnCheck (6.85s)
        style_test.go:590: err=exit status 1, stderr=/root/src/github.com/cockroac
hdb/cockroach/pkg/util/log/clog.go:926:13: Dup2 not declared by package syscall
B-unsupported-arch C-question O-community

All 32 comments

Building with make TAGS=stdmalloc will disable c-jemalloc and might allow the build to get a little further.

The Dup2 issue is weird; why isn't the posix-standard Dup2 available? Dup3 is linux-specific. It's straightforward to work around this with build tags (which we're going to have to do for windows anyway: #14385), but it would be nice to not have a third variant of this code.

Making progress -

Patched the Dup2 -> Dup3 issue, that needs to be chased again upstream.

Some issues with kv:

# github.com/cockroachdb/cockroach/pkg/kv
pkg/kv/txn_coord_sender.go:1002: offset out of range: 393
00164 (/root/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:344)MOVD     "".txnID+33(FP), R0
pkg/kv/txn_coord_sender.go:1002: offset out of range: 401
00168 (/root/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:344)MOVD     "".txnID+41(FP), R1
# github.com/cockroachdb/cockroach/pkg/kv
pkg/kv/txn_coord_sender.go:1002: offset out of range: 393
00164 (/root/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:344)MOVD     "".txnID+33(FP), R0
pkg/kv/txn_coord_sender.go:1002: offset out of range: 401
00168 (/root/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:344)MOVD     "".txnID+41(FP), R1

and then this final style issue in TestVet, which looks unrelated to the architecture:

    --- FAIL: TestStyle/TestVet (22.06s)
        style_test.go:464: ccl/storageccl/writebatch_test.go:77: declaration of "r
esult" shadows declaration at ccl/storageccl/writebatch_test.go:75
        style_test.go:464: ccl/storageccl/writebatch_test.go:94: declaration of "r
esult" shadows declaration at ccl/storageccl/writebatch_test.go:92
        style_test.go:464: ccl/storageccl/writebatch_test.go:100: declaration of "
result" shadows declaration at ccl/storageccl/writebatch_test.go:98
        style_test.go:464: server/admin.go:609: declaration of "client" shadows de
claration at server/admin.go:39
        style_test.go:464: server/admin.go:1015: declaration of "client" shadows d
eclaration at server/admin.go:39
        style_test.go:464: server/admin_test.go:782: declaration of "e" shadows de
claration at server/admin_test.go:776
        style_test.go:464: server/server_test.go:182: declaration of "client" shad
ows declaration at server/server_test.go:35
        style_test.go:464: server/status_test.go:219: declaration of "wrapper" sha
dows declaration at server/status_test.go:208

I suspect the kv issue you're seeing is https://github.com/golang/go/issues/19137. It sounds like the fix was cherry-picked onto the Go 1.8 release branch recently. Any chance the version of Go 1.8 you're using is from before the cherry-pick?

The TestVet issues are strange (though they won't block the build or anything). @tamird, do you have any idea why we'd be seeing a more strict version of go vet? It looks to be counting a few extra cases in its shadowed variable check.

The version of Go 1.8 I have is from Ubuntu's "Go Language Gophers" team, https://launchpad.net/~gophers/+archive/ubuntu/archive . I'll confer with @mwhudson and see if I can get a more recent build of go tip in that collection.

You may just want to wait for Go 1.8.1.

On Tue, Mar 28, 2017 at 8:52 AM, Edward Vielmetti notifications@github.com
wrote:

The version of Go 1.8 I have is from Ubuntu's "Go Language Gophers" team,
https://launchpad.net/~gophers/+archive/ubuntu/archive
https://launchpad.net/%7Egophers/+archive/ubuntu/archive . I'll confer
with @mwhudson https://github.com/mwhudson and see if I can get a more
recent build of go tip in that collection.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/14405#issuecomment-289759733,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPJksYZp8vgyb8uGsA9gx60t_tiQVks5rqQKXgaJpZM4MrFtY
.

@vielmetti can you try this again? I believe it should work for you now.

Thanks @tamird setting up a build system now.

@vielmetti any luck?

OK, some reports.

make TAGS=stdmalloc builds just fine.

I get this error c++: error: unrecognized command line option ‘-msse4.2’ when I try to do make test. It looks like it's coming from

c-deps/rocksdb-0013-cmake-msse.patch:+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -msse4.2")

which look like @tamird added with this patch

Author: Tamir Duberstein <[email protected]>
Date:   Thu Apr 13 20:09:41 2017 -0400

    CMake: add support for SSE4.2

I think I've fixed that in the past day - if you pull a commit later than https://github.com/cockroachdb/cockroach/commit/9b5de83b0652e14786259e5b76011060a5fd0cb7 and try make build, things should work.

TAGS=stdmalloc is no longer a thing, though.

OK, I looked at requirements more carefully and ran across this:

A C++ compiler that supports C++11. Note that GCC prior to 6.0 does not work due to this issue.
~/go/src/github.com/cockroachdb/cockroach# g++ --version
g++ (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.

I'll track down a more recent vintage GCC for this machine.

I think you're fine with that GCC version, just give it another try with the more recent sources. -msse4.2 doesn't make sense on your architecture anyway.

SSE 4.2 is an intel/amd64 thing; this thread is about ARM. We shouldn't add -msse4.2 on other architectures.

yeah, I think if the CMakefile has a conditional that doesn't do -msse4.2 (but instead in some future time, supports whatever ARM assembler is needed to take advantage of acceleration in the architecture) that we'll be fine.

As I said earlier, if you build with https://github.com/cockroachdb/cockroach/commit/9b5de83b0652e14786259e5b76011060a5fd0cb7 or later, we already omit -msse4.2.

@vielmetti could you try again with latest master?

Build of latest master success, but a test fails; here's a gist:

https://gist.github.com/b66b025920e0fe3ea493a0452fd45fb3

The gist of it seems to be this failure:

logic_test.go:1623: 
                testdata/logic_test/builtin_function:495: 
                expected:
                    0.36787944117144233 2.718281828459045 7.3890560989306502272

                but found (query options: "") :
                    0.36787944117144233  2.7182818284590455  7.3890560989306502
272  

but you'll want to look at the whole thing to see what's really going on. This is replicable.

paging @mjibson, even though this looks like an inconsistency in Go's float implementation?

This is almost certainly caused by a slight difference in arm64 and amd64. We probably need to change that test to round(exp(1.0::float), 13) to make the test pass. It is also likely many other tests after that line will fail, so we may need a real arm device to test on.

@mjibson - I am happy to provide access to a real ARM device for you for test (Packet Type 2A 96-core machine).

Please do! If you could set up ssh access for us, that would be great. Our Github accounts include our public keys, which you can retrieve via the Github API. For example, to get my public key:

curl -sfSL https://api.github.com/users/tamird/keys | python -c 'import sys, json; print json.load(sys.stdin)[0]["key"]'

Check your Gmail inboxes for invites, when you upload your public keys to Packet I can push them into the server.

Uploaded my SSH key!

On Thu, Apr 27, 2017 at 3:19 PM, Edward Vielmetti notifications@github.com
wrote:

Check your Gmail inboxes for invites, when you upload your public keys to
Packet I can push them into the server.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/14405#issuecomment-297813042,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPM1NK0U2jt_WBTnADdPGRQ75fH69ks5r0Oo2gaJpZM4MrFtY
.

@tamird - when you pin down the float inconsistency, let's open up a bug in Go. I have had other reports of exp() returning results that were not the same as the Intel results on ARMv8, so I'm looking for something like a minimal replication that I can track upstream.

Sadly this is not going to be a quick fix. Take the following program:

package main

func main() {
    x := float64(-1)
    println(uint64(x))
}

In amd64, this prints 18446744073709551615. On the ARMv8, it prints 0. This was uncovered by a test in the round function. Thus CockroachDB depends on some either out-of-spec or architecture-dependent behavior. We could (and perhaps should) fix this to be more platform-independent. I'm a little worried that even if we do fix the problems present in these tests there may be other behavior like this.

@mjibson Uh, that's bizarre. The Go spec says that conversion from a floating point to an integer will discard the fractional part (truncation towards zero).

I was able to fix the rounding issue (https://github.com/mjibson/cockroach/tree/arm64). Here's some currently failing tests:

--- FAIL: TestConcurrentBatch (30.43s)
    rocksdb_test.go:416: write took 26.2s

=== RUN   TestShowCreateView
panic: test timed out after 4m0s

goroutine 1066646 [running]:
testing.startAlarm.func1()
    /usr/lib/go-1.8/src/testing/testing.go:1023 +0xd0
created by time.goFunc
    /usr/lib/go-1.8/src/time/sleep.go:170 +0x38

goroutine 1 [chan receive]:
testing.(*T).Run(0x44200e6ea0, 0x16ccc31, 0x12, 0x178a7f8, 0x1)
    /usr/lib/go-1.8/src/testing/testing.go:698 +0x260
testing.runTests.func1(0x44200e6ea0)
    /usr/lib/go-1.8/src/testing/testing.go:882 +0x58
testing.tRunner(0x44200e6ea0, 0x4420671da0)
    /usr/lib/go-1.8/src/testing/testing.go:657 +0x84
testing.runTests(0x44202aab60, 0x212f4a0, 0x93, 0x93, 0xed09ec2b1)
    /usr/lib/go-1.8/src/testing/testing.go:888 +0x260
testing.(*M).Run(0x4420f61f28, 0x25c1d40)
    /usr/lib/go-1.8/src/testing/testing.go:822 +0xe0
github.com/cockroachdb/cockroach/pkg/sql_test.TestMain(0x4420671f28)
    /root/go/src/github.com/cockroachdb/cockroach/pkg/sql/main_test.go:228 +0x110
main.main()
    github.com/cockroachdb/cockroach/pkg/sql/_test/_testmain.go:362 +0xe0

goroutine 17 [syscall, 4 minutes, locked to thread]:
runtime.goexit()
    /usr/lib/go-1.8/src/runtime/asm_arm64.s:981 +0x4

goroutine 97 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).flushDaemon(0x25864c0)
    /root/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1024 +0x50
created by github.com/cockroachdb/cockroach/pkg/util/log.init.1
    /root/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:566 +0xd4

goroutine 130 [syscall, 4 minutes]:
os/signal.signal_recv(0x0)
    /usr/lib/go-1.8/src/runtime/sigqueue.go:116 +0xd8
os/signal.loop()
    /usr/lib/go-1.8/src/os/signal/signal_unix.go:22 +0x18
created by os/signal.init.1
    /usr/lib/go-1.8/src/os/signal/signal_unix.go:28 +0x30

goroutine 595 [select, locked to thread]:
runtime.gopark(0x178c8b8, 0x0, 0x16bb1f5, 0x6, 0x18, 0x2)
    /usr/lib/go-1.8/src/runtime/proc.go:271 +0x100
runtime.selectgoImpl(0x442048b750, 0x18, 0x2)
    /usr/lib/go-1.8/src/runtime/select.go:423 +0xec0
runtime.selectgo(0x442048b750)
    /usr/lib/go-1.8/src/runtime/select.go:238 +0x14
runtime.ensureSigM.func1()
    /usr/lib/go-1.8/src/runtime/signal_unix.go:434 +0x2ec
runtime.goexit()
    /usr/lib/go-1.8/src/runtime/asm_arm64.s:981 +0x4

goroutine 1008105 [runnable]:
strings.Contains(0x445317a794, 0xe8, 0x16c5268, 0xd, 0x0)
    /usr/lib/go-1.8/src/strings/strings.go:95 +0x58
github.com/cockroachdb/cockroach/pkg/util/leaktest.interestingGoroutines(0xed09ec3a1)
    /root/go/src/github.com/cockroachdb/cockroach/pkg/util/leaktest/leaktest.go:41 +0x1e4
github.com/cockroachdb/cockroach/pkg/util/leaktest.AfterTest.func1()
    /root/go/src/github.com/cockroachdb/cockroach/pkg/util/leaktest/leaktest.go:76 +0xc0
github.com/cockroachdb/cockroach/pkg/sql_test.TestShowCreateView(0x4438aea000)
    /root/go/src/github.com/cockroachdb/cockroach/pkg/sql/show_test.go:284 +0xbb8
testing.tRunner(0x4438aea000, 0x178a7f8)
    /usr/lib/go-1.8/src/testing/testing.go:657 +0x84
created by testing.(*T).Run
    /usr/lib/go-1.8/src/testing/testing.go:697 +0x240
exit status 2

I wonder if the TestConcurrentBatch issues are related in some way to https://github.com/cockroachdb/cockroach/issues/12871 which had some extensive discussion.

The response back on the rounding issue (from Ian Lance Taylor, to golang-nuts) reads like this:

For the record, the spec says, in
https://golang.org/ref/spec#Conversions: "In all non-constant
conversions involving floating-point or complex values, if the result
type cannot represent the value the conversion succeeds but the result
value is implementation-dependent."  That is the case that applies
here: you are converting a negative floating point number to uint64,
which can not represent a negative value, so the result is
implementation-dependent.  The conversion to int64 works, of course.
And the conversion to int64 and then to uint64 succeeds in converting
to int64, and when converting to uint64 follows a different rule:
"When converting between integer types, if the value is a signed
integer, it is sign extended to implicit infinite precision; otherwise
it is zero extended. It is then truncated to fit in the result type's
size."

So, basically, don't convert a negative floating point number to an
unsigned integer type.

I'm going to write a lint to make sure we don't do that.

@mjibson with #15864 applied I have all tests passing on arm64 (except for TestConcurrentBatch) - the issue was just that it's too slow to fit in under our default timeout.

TestConcurrentBatch does require additional investigation, but I think it's high time we gave this issue a close (since #15864 is test-only).

@vielmetti thank you for the report, and for the server!

Opened a new issue for TestConcurrentBatch: #15868.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

nvanbenschoten picture nvanbenschoten  Â·  3Comments

bdarnell picture bdarnell  Â·  4Comments

xudongzheng picture xudongzheng  Â·  3Comments

mjibson picture mjibson  Â·  3Comments

jordanlewis picture jordanlewis  Â·  4Comments