Bazel: 0.8.x and 0.9.x release series breaking rules_webtesting Go code in Travis CI

Created on 15 Dec 2017  Â·  45Comments  Â·  Source: bazelbuild/bazel

In Travis CI's linux machines (multiple different versions of them), using a 0.8.x version of bazel causes immediate test failures of any Go code using rules_webtesting's API.

It works fine on macOS and, I bet, on non-containerized Linux machines.

There is a reproduction repo available at https://github.com/jmhodges/bazel_bugs/tree/webtesting_08 (that's the webtesting_08 branch of my bazel_bugs repo). You can test it by running bazel test //foo/... locally vs running it in Travis CI by forking and pushing to a branch.

The 0.7.0, 0.6.1, and 0.6.0 verisons of bazel all work correctly.

More discussion happened on rules_webtesting's ticket, but it seems probable that it's a bazel issue, proper.

sandboxing bug

All 45 comments

So, you're saying that bazel behaves differently, depending on whether it runs in a container or on a plain machine? That indeed seems to be related to sandboxing.

Assigning @philwo for further triage.

I can reproduce this on my workstation with Bazel 0.8.1. Thanks for the excellent repro case and instructions, @jmhodges!

I'll run git bisect on Bazel to figure out what caused the regression.

The culprit is 0ebb3e54fc890946ae6b3d059ecbd50e4b5ec840.

There was supposed to be a rollback for this change, but this still fails at HEAD for me, so something weird is going on. Investigating.

I think there might be two unrelated breakages going on, which confused the bisect algorithm: One breakage causes the test to fail after 3 minutes (this is the one that we're seeing now) and the above mentioned culprit caused the test to never finish. I'll have to untangle this and run another bisect...

Oh great! (I imagine that means it's busted on all linux machines? Wild!)

Er, I mean, it's great that you can repro and that you're looking! Thanks for that!

The real culprit for this breakage is cfccdf1f6e93125d894ff40e0ccecaf20cc20ef5 (FYI @laszlocsomor).

The change adds a unique TMPDIR environment variable to each running action / test. Before the change, that variable would not be set at all. After the change, the variable points to an empty directory automatically created / deleted by Bazel for each action somewhere below the execution root. I don't think the change is to blame here - this behavior looks fine to me.

Does this help? It sounds like the failing test might somehow react to TMPDIR suddenly being set, while some other code doesn't and uses a hard-coded /tmp path and then maybe the one creates a socket and the other doesn't fit, or something?

One question to @laszlocsomor: Why do we set TEST_TMPDIR to a different path that TMPDIR? Sounds like those might (should?) be the same? On my system, I'm seeing these for example, for a simple sh_test with disabled sandboxing:

TEST_TMPDIR=/usr/local/google/home/philwo/.cache/bazel/_bazel_but_philwo/00d545a591fa2eba613af7748fffc2a9/execroot/bazel_bugs/_tmp/c7f21797a8c0a2ee58c25417dcb4aa7a

TMPDIR=/usr/local/google/home/philwo/.cache/bazel/_bazel_but_philwo/00d545a591fa2eba613af7748fffc2a9/execroot/bazel_bugs/local-spawn-runner.3119753370860597927/work

@philwo :

Why do we set TEST_TMPDIR to a different path that TMPDIR?

I don't know of any reason that'd require that. The reason they are different is probably that I didn't want to mess with TEST_TMPDIR within the same commit, but I can't remember anymore. I think they should be the same.

Hm, are you sure that was the commit? I built a bazel from 573a47ad80eb3553566087180d3f02149a2dc9f4 which is the commit before cfccdf1f6e93125d894ff40e0ccecaf20cc20ef5 and pushed it into this build: https://travis-ci.org/jmhodges/bazel_bugs/jobs/318431517

It's now taking so long to fail (16 minutes plus) that bazel seems to be timing it out. I reproduced this behavior locally, too. The 0.8.x build failures usually took about 4 to 5 minutes.

Is 573a47ad80eb3553566087180d3f02149a2dc9f4 buggy? Is there a known good commit you know to try out?

Here's the Travis CI log file. Travis kept serving a truncated form to me way after it had failed, so you can use this link in case the log above is busted.

log.txt

@jmhodges : Can you try using Bazel built at https://github.com/bazelbuild/bazel/commit/cfccdf1f6e93125d894ff40e0ccecaf20cc20ef5 too?
If that behaves differently (e.g. fails faster) than https://github.com/bazelbuild/bazel/commit/573a47ad80eb3553566087180d3f02149a2dc9f4, then the former change also tickles something in the test.

From the log:

2017/12/19 04:05:35 launching HTTP server at: travis-job-jmhodges-bazel-bugs-318431517.travisci.net:37833
Starting ChromeDriver 2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4) on port 36509
[68 / 69] Still waiting for 1 job to complete:
      Running (local):
        Testing //foo:test_chromium-native, 460 s
s:map[chromeOptions:map[args:[--no-sandbox] binary:/home/travis/.cache/bazel/_bazel_travis/15d67762a08970724949112c87b4af55/execroot/bazel_bugs/_tmp/1f6a4a29ee8e3391c9b6c51a7dc6053a/chrome-linux.zip560944594/chrome-linux/chrome] browserName:chrome] Always:map[chromeOptions:map[args:[--no-sandbox] binary:/home/travis/.cache/bazel/_bazel_travis/15d67762a08970724949112c87b4af55/execroot/bazel_bugs/_tmp/1f6a4a29ee8e3391c9b6c51a7dc6053a/chrome-linux.zip560944594/chrome-linux/chrome] browserName:chrome] First:[]}
PASS
Terminated
================================================================================

I see that the test started a HTTP server, then waited at least 460s (7m 40s), then seemingly passed. I'm not sure what or when printed the "PASS" and "Terminated" messages. Do those come from the test script? If so, could you change the script to timestamps?

If you have control over the flags Travis passes to Bazel, then --color=no --curses=no --show_timestamps would be nice to add, as they'd improve test log readability.

I’m 100% sure this is the culprit, but you can’t easily see it by just going back to the commit before that, because there was another overlapping breakage that was later rolled back (the culprit and rollback I mentioned first in my research).

The first breakage causes a “test never finishes” problem, but that was already rolled back. The TMPDIR change causes the “test fails after 180 seconds” breakage.

That would make sense! What would be a good way to repro?

I've currently got the two builds going but it looks like they are going to fail in the same way.

Before: https://travis-ci.org/jmhodges/bazel_bugs/builds/318523927

After: https://travis-ci.org/jmhodges/bazel_bugs/builds/318527129

I'm also down for trying out a patch against HEAD or 0.8.x! I tried a simple revert and saw it conflicted and didn't dig in to fix them.

Oh, also, I was wrong! One totally hit FAIL and the other hit PASS (and both timed out).

(New release came out but has the same bug, so renamed the ticket)

It is really not clear to me that this is a bug in Bazel - if something breaks because TMPDIR now is set to a valid directory (as opposed to not being set before the change), then maybe the thing that breaks is somehow not dealing with that situation correctly? Maybe some part of the code respects $TMPDIR and some other part not and now they no longer look into the same directory?

I would advise debugging whatever rules_webtesting does that breaks. For me that whole thing is a blackbox - I'm not familiar with webtesting and have no idea what it does or how to debug it. For someone familiar with it though, maybe it's easier to see what exactly is failing in these 180 seconds?

@laszlocsomor Can you make sure that the CL that was identified as the culprit does not accidentally do other things that might break rules_webtesting? I had a look at it, but couldn't see anything suspicious.

@philwo : thanks for following up on this bug.
@jmhodges : can you help us repro this issue locally? I just checked out https://github.com/bazelbuild/rules_webtesting and built //... with Bazel 0.9.0 on Linux without any problems:

[13:59:22]-[laszlocsomor@stueck1]-[~/gitroot/rules_webtesting]
  $ bazel version
.........
Build label: 0.9.0
Build target: bazel-out/k8-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Dec 19 09:31:58 2017 (1513675918)
Build timestamp: 1513675918
Build timestamp as int: 1513675918

[13:59:25]-[laszlocsomor@stueck1]-[~/gitroot/rules_webtesting]
  $ bazel --bazelrc=/dev/null build //...
INFO: Analysed 104 targets (108 packages loaded).
INFO: Found 104 targets...
INFO: Elapsed time: 7.802s, Critical Path: 1.79s
INFO: Build completed successfully, 22 total actions

[13:59:41]-[laszlocsomor@stueck1]-[~/gitroot/rules_webtesting]
  $ git rev-parse HEAD
d9009f881c19d3da520943955d9df80b492e9235

Never mind, I had to test, not build, the targets. I see failures now.

Repro:

$ git clone https://github.com/jmhodges/bazel_bugs.git gh4303
$ cd gh4303
$ git checkout origin/webtesting_08
$ bazel test //foo/...

I see the same errors as @jmhodges linked in his comment earlier: https://github.com/bazelbuild/bazel/issues/4303#issuecomment-352636540

Following @philwo's hunch, I looked at which rules use TMP explicitly:

  $ find $(bazel info output_base)/external -name '*.bzl' | xargs grep -m1 "TMP" | sed 's,^.*\(external/[^:]*\):.*$,\1,'
external/io_bazel_rules_go/go/private/repository_tools.bzl
external/io_bazel_rules_go/go/private/toolchain.bzl
external/io_bazel_rules_go/go/private/go_toolchain.bzl

The WORKSPACE file pulls these from https://github.com/bazelbuild/rules_go/commit/b2a59d8140f33174ca9cbac2cf5ab0bf0997826c.

That commit is old, updating the WORKSPACE to pull a newer one might be a good idea. Current head is https://github.com/bazelbuild/rules_go/commit/8e0eef1c3e42c6ccd832cff19a6a8d2ae496935e, alas the test still fails with that.

After updating that commit hash, these files contain references to TMP:

  $ find $(bazel info output_base)/external -name '*.bzl' | xargs grep -m1 "TMP" | sed 's,^.*\(external/[^:]*\):.*$,\1,'
external/io_bazel_rules_go/tests/bazel_tests.bzl
external/io_bazel_rules_go/go/private/repository_tools.bzl
external/io_bazel_rules_go/go/private/common.bzl

I'm still not sure why https://github.com/bazelbuild/bazel/commit/cfccdf1f6e93125d894ff40e0ccecaf20cc20ef5 could have broken the Go rules. The difference before and after that change is:

  • Bazel will set a TMPDIR envvar for every Spawn, which I believe includes repository_ctx.execute actions.
  • TMPDIR will point to a unique, empty directory under $(bazel info execution_root). It will not be the same as TMP or the user's TMPDIR envvar, no matter what the action requests.

Sounds like we'd need a rules_go person to comment?

I guess I could cc @ianthehat and @jayconrod myself.

I reproduced the failure, but I don't see anything that indicates this is a rules_go issue. The test binary builds and links successfully, but it fails when it runs because it can't start Chrome.

Our remaining use of TMP and TMPDIR are in repository rules, but we're just passing those through to commands, not explicitly using them for anything.

FAIL: //foo:test_chromium-native (see /usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/testlogs/foo/test_chromium-native/test.log)
____From Testing //foo:test_chromium-native:
==================== Test output for //foo:test_chromium-native:
GTEST_TMP_DIR=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/_tmp/b66c9f9666e4f6a7df28a5d0483bd35b
TEST_INFRASTRUCTURE_FAILURE_FILE=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/testlogs/foo/test_chromium-native/test.infrastructure_failure
TEST_SRCDIR=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/bin/foo/test_chromium-native.runfiles
RUNFILES_MANIFEST_FILE=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/bin/foo/test_chromium-native.runfiles/MANIFEST
WEB_TEST_METADATA=bazel_bugs/foo/test_chromium-native.gen.json
JAVA_RUNFILES=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/bin/foo/test_chromium-native.runfiles
TMPDIR=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/tmp25b_e2489700cdf06e02
TEST_UNUSED_RUNFILES_LOG_FILE=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/testlogs/foo/test_chromium-native/test.unused_runfiles_log
TEST_LOGSPLITTER_OUTPUT_FILE=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/testlogs/foo/test_chromium-native/test.raw_splitlogs/test.splitlogs
TEST_BINARY=foo/test_chromium-native
USER=jayconrod
TEST_UNDECLARED_OUTPUTS_DIR=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/testlogs/foo/test_chromium-native/test.outputs
RUNFILES_DIR=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/bin/foo/test_chromium-native.runfiles
TEST_TIMEOUT=900
PATH=.:/usr/local/google/users/jayconrod/aswb-sdk/tools:/usr/local/google/users/jayconrod/aswb-sdk/platform-tools:/usr/local/google/home/jayconrod/Code/go-1.9.2/bin:/usr/local/google/home/jayconrod/.go/bin:/usr/local/google/home/jayconrod/bin:/usr/local/google/users/jayconrod/aswb-sdk/tools:/usr/local/google/users/jayconrod/aswb-sdk/platform-tools:/usr/local/google/home/jayconrod/Code/go-1.9.2/bin:/usr/local/google/home/jayconrod/.go/bin:/usr/local/google/home/jayconrod/bin:/usr/lib/google-golang/bin:/usr/local/buildtools/java/jdk/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
TEST_TEMPDIR=test_tempdir.hQSaS1
PWD=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/bin/foo/test_chromium-native.runfiles/bazel_bugs
TEST_WARNINGS_OUTPUT_FILE=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/testlogs/foo/test_chromium-native/test.warnings
TZ=UTC
TEST_UNDECLARED_OUTPUTS_ANNOTATIONS_DIR=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/testlogs/foo/test_chromium-native/test.outputs_manifest
SHLVL=2
RUN_UNDER_RUNFILES=1
TEST_SIZE=large
TEST_TMPDIR=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/_tmp/b66c9f9666e4f6a7df28a5d0483bd35b
TEST_WORKSPACE=bazel_bugs
XML_OUTPUT_FILE=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/testlogs/foo/test_chromium-native/test.xml
PYTHON_RUNFILES=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/bin/foo/test_chromium-native.runfiles
TEST_PREMATURE_EXIT_FILE=/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/bazel-out/k8-fastbuild/testlogs/foo/test_chromium-native/test.exited_prematurely
BASH_FUNC_rlocation%%=() {  if is_absolute "$1"; then
 echo "$1";
 else
 echo "$(dirname $RUNFILES_MANIFEST_FILE)/$1";
 fi
}
BASH_FUNC_is_absolute%%=() {  [[ "$1" = /* ]] || [[ "$1" =~ ^[a-zA-Z]:[/\\].* ]]
}
_=/usr/bin/printenv
2017/12/21 18:37:33 launching HTTP server at: jayconrod1.nyc.corp.google.com:37224
2017/12/21 18:37:34 Creating session

Starting ChromeDriver 2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4) on port 43494
Only local connections are allowed.
2017/12/21 18:37:37 Caps: {OSSCaps:map[chromeOptions:map[args:[--no-sandbox] binary:/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/_tmp/b66c9f9666e4f6a7df28a5d0483bd35b/chrome-linux.zip380330947/chrome-linux/chrome] browserName:chrome] Always:map[browserName:chrome chromeOptions:map[args:[--no-sandbox] binary:/usr/local/google/home/jayconrod/.cache/bazel/_bazel_jayconrod/d0d5517dc897dbae499c450d81dc45be/execroot/bazel_bugs/_tmp/b66c9f9666e4f6a7df28a5d0483bd35b/chrome-linux.zip380330947/chrome-linux/chrome]] First:[]}
--- FAIL: TestStart (183.20s)
    wd_test.go:14: error starting webdriver: session not created: [Go WebDriver Client] (unknown error) unknown error: Chrome failed to start: exited abnormally
          (Driver info: chromedriver=2.33.506092 (733a02544d189eeb751fe0d7ddca79a0ee28cce4),platform=Linux 4.4.0-97-generic x86_64)
FAIL
2017/12/21 18:40:38 test failed exit status 1

Okay, since the rules_webtesting folks haven't chimed in, I forked it and adding some logging from chromedriver to my tests. The branches webtesting_08_logging and webtesting_07_logging exist on the origin now.

I plopped the logs from webtesting_08_logging to stdout in travis (webtesting_08_logging build vs webtesting_07_logging build) and here's the most interesting section:

[3.608][DEBUG]: DevTools request: http://localhost:12793/json/version
[3.608][DEBUG]: DevTools request failed
[3.658][DEBUG]: DevTools request: http://localhost:12793/json/version
[3.659][DEBUG]: DevTools request failed
[3.709][DEBUG]: DevTools request: http://localhost:12793/json/version
[3.710][DEBUG]: DevTools request failed
[5894:5894:1222/081804.705861:FATAL:process_singleton_posix.cc(247)] Check failed: SetupSockAddr(path, addr). Socket path too long: /home/travis/.cache/bazel/_bazel_travis/15d67762a08970724949112c87b4af55/execroot/bazel_bugs/tmp247_40517a5493408639/.org.chromium.Chromium.0U8M3K/SingletonSocket
#0 0x5630eb919a87 base::debug::StackTrace::StackTrace()
#1 0x5630eb92d3d1 logging::LogMessage::~LogMessage()
#2 0x5630eb878118 ProcessSingleton::Create()
#3 0x5630eb877d65 ProcessSingleton::NotifyOtherProcessWithTimeoutOrCreate()
#4 0x5630eb877cab ProcessSingleton::NotifyOtherProcessOrCreate()
#5 0x5630eb8267db ChromeBrowserMainParts::PreMainMessageLoopRunImpl()
#6 0x5630eb82649a ChromeBrowserMainParts::PreMainMessageLoopRun()
#7 0x5630ea7e119d content::BrowserMainLoop::PreMainMessageLoopRun()
#8 0x5630eaaebf87 content::StartupTaskRunner::RunAllTasksNow()
#9 0x5630ea7df5dd content::BrowserMainLoop::CreateStartupTasks()
#10 0x5630ea7e3bbc content::BrowserMainRunnerImpl::Initialize()
#11 0x5630ea7dccb2 content::BrowserMain()
#12 0x5630eb62b4ed content::ContentMainRunnerImpl::Run()
#13 0x5630eb6334d7 service_manager::Main()
#14 0x5630eb62a192 content::ContentMain()
#15 0x5630ea2901ac ChromeMain
#16 0x7f09e3949f45 __libc_start_main
#17 0x5630ea290010 <unknown>
Received signal 6
#0 0x5630eb919a87 base::debug::StackTrace::StackTrace()
#1 0x5630eb9195ef base::debug::(anonymous namespace)::StackDumpSignalHandler()
#2 0x7f09e3d01330 <unknown>
#3 0x7f09e395ec37 gsignal
#4 0x7f09e3962028 abort
#5 0x5630eb918482 base::debug::BreakDebugger()
#6 0x5630eb92d893 libGL error: failed to load driver: swrast
Received signal 11 SEGV_MAPERR 0000000000b8
#0 0x55c14aadda87 logging::LogMessage::~LogMessage()
#7 0x5630eb878118 base::debug::StackTrace::StackTrace()
#1 0x55c14aadd5ef base::debug::(anonymous namespace)::StackDumpSignalHandler()
#2 0x7f178a88d330 <unknown>
#3 0x7f177dd07686 <unknown>
#4 0x55c14b59cb54 gl::(anonymous namespace)::CreateDummyWindow()
#5 0x55c14b59c98b ProcessSingleton::Create()
#8 0x5630eb877d65 ProcessSingleton::NotifyOtherProcessWithTimeoutOrCreate()
#9 0x5630eb877cab gl::GLSurfaceGLX::InitializeOneOff()
#6 0x55c14b5cd10d ProcessSingleton::NotifyOtherProcessOrCreate()
#10 0x5630eb8267db gl::init::InitializeGLOneOffPlatform()
#7 0x55c14b5cc4c9 [3.760][DEBUG]: DevTools request: http://localhost:12793/json/version
[3.760][DEBUG]: DevTools request failed
ChromeBrowserMainParts::PreMainMessageLoopRunImpl()
#11 0x5630eb82649a gl::init::InitializeGLOneOffImplementation()
#8 0x55c14b5cc2ef ChromeBrowserMainParts::PreMainMessageLoopRun()
#12 0x5630ea7e119d content::BrowserMainLoop::PreMainMessageLoopRun()
#13 0x5630eaaebf87 gl::init::InitializeGLOneOff()
#9 0x55c14b5e0a57 content::StartupTaskRunner::RunAllTasksNow()
#14 0x5630ea7df5dd content::BrowserMainLoop::CreateStartupTasks()
#15 0x5630ea7e3bbc gpu::GpuInit::InitializeAndStartSandbox()
#10 0x55c14ce4c4a0 content::BrowserMainRunnerImpl::Initialize()
#16 0x5630ea7dccb2 content::BrowserMain()
#17 0x5630eb62b4ed content::ContentMainRunnerImpl::Run()
#18 0x5630eb6334d7 content::GpuMain()
#11 0x55c14a7ef4ed service_manager::Main()
#19 0x5630eb62a192 content::ContentMainRunnerImpl::Run()
#12 0x55c14a7f74d7 content::ContentMain()
#20 0x5630ea2901ac service_manager::Main()
#13 0x55c14a7ee192 content::ContentMain()
#14 0x55c1494541ac ChromeMain
#21 0x7f09e3949f45 __libc_start_main
#22 0x5630ea290010 ChromeMain
#15 0x7f178a4d5f45 __libc_start_main
#16 0x55c149454010 [3.810][DEBUG]: DevTools request: http://localhost:12793/json/version
[3.811][DEBUG]: DevTools request failed
<unknown>
  r8: ffffb3b7e9c110c0  r9: ffffb3b7e9c110b0 r10: 0000000000000008 r11: 0000000000000206
 r12: 00007ffd80e22e20 r13: 0000000000000126 r14: 00007ffd80e22e18 r15: 00007ffd80e22e10
  di: 0000000000001706  si: 0000000000001706  bp: 00007ffd80e22960  bx: 00007ffd80e22960
  dx: 0000000000000006  ax: 0000000000000000  cx: 00007f09e395ec37  sp: 00007ffd80e227b8
  ip: 00007f09e395ec37 efl: 0000000000000206 cgf: 0000000000000033 erf: 0000000000000004
 trp: 000000000000000e msk: 0000000000000000 cr2: 0000000000000000
[end of stack trace]
Calling _exit(1). Core file will not be generated.
<unknown>
  r8: 0000000000000000  r9: 000055c14ffcb150 r10: 00003ae1e8587f00 r11: 0000000000000000
 r12: 0000000000600002 r13: 0000000000000095 r14: 0000000000000000 r15: 0000000000000000
  di: 00003ae1e85cd800  si: 000000000000001f  bp: 0000000000000000  bx: 00003ae1e85cd800
  dx: 00003ae1e85ec000  ax: 000000000000001f  cx: 0000000000000018  sp: 00007ffdfce05c50
  ip: 00007f177dd07686 efl: 0000000000010206 cgf: 0000000000000033 erf: 0000000000000004
 trp: 000000000000000e msk: 0000000000000000 cr2: 00000000000000b8
[end of stack trace]
Calling _exit(1). Core file will not be generated.

From there it's just

[3.861][DEBUG]: DevTools request: http://localhost:12793/json/version
[3.861][DEBUG]: DevTools request failed

repeated over and over.

There's also some NaCl, swrast, and RANDR error logging but those exist in both logs, so are likely irrelevant. We're left with the Socket path too long error.

Apparently, this is a real limitation of some Linux systems where the max socket path is 107 characters.

So, the patch we've been discussing that creates a new temp directory is also causing the new temp directories to be much longer causing some systems to hit this limitation?

Apparently, this is a real limitation of some Linux systems where the max socket path is 107 characters.

Ouch, this limitation surfaced recently on another issue too: https://github.com/bazelbuild/bazel/issues/3215#issuecomment-348994701

The thread contains some workaround ideas (https://github.com/bazelbuild/bazel/issues/3215#issuecomment-350004233) and a proposal for a principled solution (https://github.com/bazelbuild/bazel/issues/3215#issuecomment-349932419).

So, the patch we've been discussing that creates a new temp directory is also causing the new temp directories to be much longer causing some systems to hit this limitation?

Yes.

Ah, the workaround mentioned seems to be specific to the JVM test targets and isn't available here in Go world.

I think this per-action directory stuff was designed without an awareness of the 107 char path length problem and maybe it could be redesigned (and the paths shortened?) with that in mind?

It's going to cause a lot more issues than these and the amount of not technically necessary info in the paths is significant.

I think this per-action directory stuff was designed without an awareness of the 107 char path length problem and maybe it could be redesigned (and the paths shortened?) with that in mind?

I designed it and I was indeed unaware of this problem.
As for redesign, I think my proposal for --local_tmp_root in https://github.com/bazelbuild/bazel/issues/3215#issuecomment-349932419 will address the problem for all local actions. (See the comment for details on why it's sufficient to solve it for local actions.)

Which prefix would it replace in the path “/home/travis/.cache/bazel/_bazel_travis/15d67762a08970724949112c87b4af55/execroot/bazel_bugs/tmp247_40517a5493408639/.org.chromium.Chromium.0U8M3K/SingletonSocket”?

It seems not great to expect every Linux (including CI) user configure this by hand this often.

It seems not great to expect every Linux (including CI) user configure this by hand this often.

Sadly I have to agree with you. :) :(

Let me see how hard it'd be to change the behavior to respect $TMPDIR and fall back to /tmp if undefined.

Sorry Jeff, unfortunately this'll have to wait until January. I have to leave now for Christmas vacation.
But you convinced me, we have to fix this. Happy holidays everyone!

Thank you so much to you and all of you for your patience with me while we figured this out! Happy holidays!

Does this need a fix cherry pick for bazel 0.10.0?

Ooo, from this user's perspective, that would be great!

I just confirmed the patches that closed out #4376 do fix this ticket! https://travis-ci.org/jmhodges/bazel_bugs/builds/329178928

Hey, I've not been able to follow along with the release process. Is this stuff in the 0.10 branch? Should I make another ticket for that?

It seems that last commit (https://github.com/bazelbuild/bazel/commit/2e631c99495f75270d2639542cefb531ec262d67) from #4376 didn't go into the release. We could cherry pick it if necessary.

Can you test with the latest RC to see if you still get this issue? You can try downloading RC4 from https://releases.bazel.build/0.10.0/rc4/index.html.

Done! 0.10rc4 is still busted, unfortunately.

https://travis-ci.org/jmhodges/bazel_bugs/builds/332436644

Thank you for the help!

Btw, it looks like 0.10rc7 does fix this! https://travis-ci.org/jmhodges/bazel_bugs/builds/333987091

@jmhodges Glad that it works now - looks like we cherry-picked the right fix! :)

Thanks so much!

Since bazel 0.10 is out and fixes this, this ticket can be closed. So, I'm doing that.

Was this page helpful?
0 / 5 - 0 ratings