Bazel: Workers need to support cancellations (e.g. to honor Ctrl+C)

Created on 17 Nov 2015  Â·  31Comments  Â·  Source: bazelbuild/bazel

When I hit Ctrl-C, bazel processes still run in the background and print messages.

FreeBSD-10.2

P3 team-Local-Exec bug

Most helpful comment

Now that Angular 9 is at the horizon (which seems to use Bazl under the hood), this issue also popped up for me. I started a build, immediately recognized that a dependency was out of date and attempted to stop via CTRL + C as it was possible before. This was the result:

0% compiling^C^C
Compiling @angular/core : es2015 as esm2015
^C
Compiling @angular/common : es2015 as esm2015
^C^C^C
Compiling @angular/common/http : es2015 as esm2015
^C
Compiling @angular/common/http/testing : es2015 as esm2015
^C^C^C^C^C^C^C^C
Compiling @angular/platform-browser : es2015 as esm2015
^C^C^C
Compiling @angular/router : es2015 as esm2015
^C^C^C^C^C^C
Compiling @angular/cdk/portal : es2015 as esm2015
^C^C^C^C^C^C^C^C^C
Compiling @angular/forms : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/cdk/keycodes : es2015 as esm2015

Compiling @angular/cdk/platform : es2015 as esm2015
^C
Compiling @angular/cdk/coercion : es2015 as esm2015
^C^C^C^C
Compiling @angular/cdk/observers : es2015 as esm2015
^C^C^C^C
Compiling @angular/cdk/a11y : es2015 as esm2015
^C^C^C^C
Compiling @angular/cdk/bidi : es2015 as esm2015
^C^C^C
Compiling @angular/cdk : es2015 as esm2015

Compiling @angular/animations : es2015 as esm2015
^C^C
Compiling @angular/animations/browser : es2015 as esm2015
^C^C^C^C
Compiling @angular/platform-browser/animations : es2015 as esm2015
^C^C^C
Compiling @angular/material/core : es2015 as esm2015

Compiling @angular/cdk/collections : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/cdk/scrolling : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/cdk/overlay : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/material/form-field : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/material/autocomplete : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/material/badge : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/cdk/layout : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/material/bottom-sheet : es2015 as esm2015

Compiling @angular/material/button : es2015 as esm2015

Compiling @angular/material/button-toggle : es2015 as esm2015

Compiling @angular/material/card : es2015 as esm2015

Compiling @angular/material/checkbox : es2015 as esm2015

Compiling @angular/material/chips : es2015 as esm2015

Compiling @angular/material/dialog : es2015 as esm2015

The process did terminate once the Angular compiler did its thing and Bazel regained control, but I had to wait almost two minutes until that happened. In this scenario it would therefore have been nice if Bazel relayed the CTRL + C to the currently running processes.

All 31 comments

This is working as intended. Ctrl+C cancel the run (waiting for subprocess to get interrupted) and 3 times Ctrl+C kills the background server and all the child thread.

Wow I cannot find documentation for that behavior, it should definitely be.

This is certainly not a conventional behavior. Usually parent process propagates the signal to the group immediately. You can even kill the whole tree in one operation:

kill -- -$(PGID)

Well Ctrl+C is a SIGTERM not a SIGKILL so it shouldn't forcefully kills its child. Forcefully killing subprocess might leads to various alteration on the file system if the subprocess does not handle sigterm nicely.

The default behavior for SIGTERM in all UNIXes is to terminate the process immediately. Child processes should behave the same, because they appear to the user as one compound process. So it makes sense to terminate them immediately too. It is their responsibility to clean up properly on signal.

The particular practical problem with the behavior you described is that I press Ctrl-C, open an editor, and see whole bunch of messages trash the editor screen. Then I refresh the screen, and some more messages trash it again.

You need to make this configurable, with both behaviors supported. Otherwise you will get a lot of complaints from the UNIX users. I actually observed such behavior several times with some other build systems, and it is very annoying -)

I was wrong. CTRL+C send a SIGINT whose default behavior is to terminate but it is not mandatory (see https://en.m.wikipedia.org/wiki/Unix_signal). Even sigterm contrary to sigkill is usually used for safe exit (cleanup and exit). SIGINT is not even required to eventually terminate.

Several program out there doesn't terminate on Ctrl C. A make build usually just terminate the child process and don't give back the hand to the user.

There is no reason to change Bazel behavior here. But it should be documented.

To clarify, Ctrl-C does try to terminate the build, but it tries to do so cleanly. This means that if the build is in a state where it cannot be immediately terminated, Ctrl-C appears to do nothing (although it will stop the build ASAP). In general, though, a single Ctrl-C will stop a build.

I want to update the Blaze UI to clearly indicate that the Ctrl-C has been received and what it's doing for shutdown. Note that triple-Ctrl-C does indeed kill bazel.

Now I have the build failing on FreeBSD for the related reason. I restarted the previously failed build, and got this error:

[305 / 399] Still waiting for 3 jobs to complete:
      Running (worker):
        Building src/main/protobuf/libxcodegen_proto.jar (0 files), 984 s
        Building src/main/protobuf/libextra_actions_base_proto.jar (0 files), 984 s
        Building src/main/java/com/google/devtools/build/lib/libbuild-info.jar (1 files), 984 s
.......cp: output/bazel: Text file busy

Creating Bazel self-extracting archive...

"Text file busy" is the error that means that the executable is running. And indeed:

[305 / 399] Still waiting for 3 jobs to complete:
      Running (worker):
        Building src/main/protobuf/libxcodegen_proto.jar (0 files), 1164 s
        Building src/main/protobuf/libextra_actions_base_proto.jar (0 files), 1164 s
        Building src/main/java/com/google/devtools/build/lib/libbuild-info.jar (1 files), 1164 s
[root@yuri /usr/ports/devel/bazel]# 
# ps ax | grep bazel
35290  -  Is       1:34.56 bazel(work) -server -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/root/.cache/bazel/_bazel_yuri/e341cd3cdeac84d112759b4525081f7c -Xverify:no
35745  -  S        0:01.28 external/local-jdk/bin/java -Xbootclasspath/p:external/bazel_tools/third_party/java/jdk/langtools/javac.jar -client -jar bazel-out/host/bin/src/
35746  -  S        0:01.29 external/local-jdk/bin/java -Xbootclasspath/p:external/bazel_tools/third_party/java/jdk/langtools/javac.jar -client -jar bazel-out/host/bin/src/
35747  -  S        0:01.30 external/local-jdk/bin/java -Xbootclasspath/p:external/bazel_tools/third_party/java/jdk/langtools/javac.jar -client -jar bazel-out/host/bin/src/
35748  -  S        3:05.94 external/local-jdk/bin/java -Xbootclasspath/p:external/bazel_tools/third_party/java/jdk/langtools/javac.jar -client -jar bazel-out/host/bin/src/
35270  4  S        0:02.66 /usr/ports/devel/bazel/work/bazel-effa572/output/bazel --nomaster_bazelrc --bazelrc=/dev/null build --singlejar_top=//src/java_tools/singlejar:b
47076  4  S+       0:00.02 grep bazel

bazel from the failed run still runs. This is an Exhibit A why you shouldn't let child processes run after the parent has finished.

It looks like a problem with the worker, summoning @philwo to see if he sees something.

The particular practical problem with the behavior you described is that I press Ctrl-C, open an editor, and see whole bunch of messages trash the editor screen. Then I refresh the screen, and some more messages trash it again.

I don't see how this can happen for a number of reasons:

  • Due to the behavior described in this thread, "I press Ctrl-C" doesn't immediately terminate Bazel. You don't get your terminal back, until Bazel has actually quit, so you can't launch an editor in the same terminal until Bazel has quit.
  • Now, assume you hammer Ctrl-C until Bazel bails out. Code in process-wrapper.c and JVM shutdown hooks still ensure that all child processes are killed.
  • However, let's assume that for some reason didn't work and child processes keep running. They still can't print anything on your screen though, because Bazel doesn't launch child processes with the real stdout/stderr - we always redirect output through files.

This is an Exhibit A why you shouldn't let child processes run after the parent has finished.

Here's what Bazel does to ensure that child processes get killed:

Please feel free to suggest improvements to this and/or send a patch.

Note that FreeBSD is not an officially supported platform, so while we'd be super happy to see it working and happily merge patches to improve support for it, we don't run our CI on it, we don't use it ourselves and we don't have much experience with it. I'm not able to reproduce this issue on either Linux nor OS X, so I can't do anything about it. I tried with Ctrl-C once, three times, kill -9 , ... it all works fine for me:

kill -9:

philwo-macbookpro:bazel philwo$ /Users/philwo/bazel/output/bazel build --strategy=Javac=worker //src/main/...
INFO: Found 129 targets...
INFO: From Building src/main/java/com/google/devtools/common/options/liboptions.jar (18 files) [for host]:
Note: src/main/java/com/google/devtools/common/options/GenericTypeHelper.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Error: unexpected EOF from Bazel server.
Contents of '/private/var/tmp/_bazel_philwo/7ebb4938e2627f7b8c20e7541fec7231/server/jvm.out':

Ctrl-C:

philwo-macbookpro:bazel philwo$ /Users/philwo/bazel/output/bazel build --strategy=Javac=worker //src/main/...
.....
INFO: Found 129 targets...
[350 / 574] Building src/main/protobuf/libbuild_proto_v2.jar (0 files) [for host]
^C
Bazel caught interrupt signal; shutting down.
ERROR: /Users/philwo/bazel/src/main/java/com/google/devtools/build/lib/BUILD:7:1: Java compilation in rule '//src/main/java/com/google/devtools/build/lib:common' failed: java failed: error executing command external/local-jdk/bin/java -Xbootclasspath/p:external/bazel_tools/third_party/java/jdk/langtools/javac.jar -client -jar external/bazel_tools/tools/jdk/JavaBuilder_deploy.jar ... (remaining 1 argument(s) skipped): null.
^C
Bazel caught interrupt signal; shutting down.

^C
Bazel caught third interrupt signal; killed.

philwo-macbookpro:bazel philwo$ 

There were no children running after both experiments.

See this log; I pressed ^C after 1 second or so. But due to the large server-class.jar target, it takes another 15 seconds for bazel to exit.

I would really like for worker processes to be interrupted more forcefully when I press ^C.

hanwen@han-wen:~/vc/gerrit$ bazel test --test_filter=ConsistencyIT --test_env=GERRIT_NOTEDB=ON //javatests/com/google/gerrit/acceptance/api/group/...
INFO: Found 1 test target...
[278 / 344] Building java/com/google/gerrit/server/libserver-class.jar (1339 source files) and running annotation processors (AutoAnnotationProcessor, AutoValueProcessor)
^C
Bazel caught interrupt signal; shutting down.
Target //javatests/com/google/gerrit/acceptance/api/group:api_group failed to build
Use --verbose_failures to see the command lines of failed build steps.
ERROR: build interrupted.
INFO: Elapsed time: 24.552s, Critical Path: 22.10s

Totally agree, @hanwen. The problem is that the current worker protocol has no mechanism to cancel a running action. We could kill the worker when the action is interrupted, but that means that they'll lose all cache and JIT and have to start again the next time.

I guess for canceling entire builds this might be a reasonable trade-off, but for canceling individual actions, for example when we run remote and local execution in parallel, it wouldn't work.

I'll think about it for a day. Maybe we should indeed kill workers with SIGTERM when the build is canceled. :|

you don't really have to kill the worker, actually. It's enough if you don't wait for the worker result, but discard in a future.

It might be a good idea to rearchitect the worker protocol based on gRPC, so you can propagate cancellation to the workers. (I assume that gRPC supports cancellation). Couldn't you then interrupt the thread doing the compilation?

We can send SIGTERM even today - what's wrong with that?

On Wed, Nov 29, 2017 at 11:35 AM, Han-Wen Nienhuys <[email protected]

wrote:

you don't really have to kill the worker, actually. It's enough if you
don't wait for the worker result, but discard in a future.

It might be a good idea to rearchitect the worker protocol based on gRPC,
so you can propagate cancellation to the workers. (I assume that gRPC
supports cancellation). Couldn't you then interrupt the thread doing the
compilation?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/614#issuecomment-347819808,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHA9YWUOL7U6l_OKRJnogt7J-I2ixGerks5s7TOBgaJpZM4Gje43
.

@hanwen That would work and be a good solution until we have the better protocol that supports cancellations. Rearchitecting the worker protocol is also on my list of todos. :)

@ulfjack Sending SIGTERM kills the worker, which discards its cache and JIT optimizations. This might be acceptable, but I wasn't sure when I initially implemented that.

The worker can catch SIGTERM and abort the processing. SIGKILL can't be
caught.

On Wed, Nov 29, 2017 at 11:48 AM, Philipp Wollermann <
[email protected]> wrote:

@hanwen https://github.com/hanwen That would work and be a good
solution until we have the better protocol that supports cancellations.
Rearchitecting the worker protocol is also on my list of todos. :)

@ulfjack https://github.com/ulfjack Sending SIGTERM kills the worker,
which discards its cache and JIT optimizations. This might be acceptable,
but I wasn't sure when I initially implemented that.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/614#issuecomment-347822991,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHA9YaLtDajDDnmdfyF_95i9BdQOPbJxks5s7TZ7gaJpZM4Gje43
.

@ulfjack Yes, but this is not in the current contract with any existing workers, so sending SIGTERM will result in killing of the workers if implemented now. Some concerns:

  • It's not clear if SIGTERM is the right signal for this purpose.
  • We don't have signals on Windows.
  • In certain languages / environments it's difficult to catch signals and install a signal handler. I could imagine this to be a problem in TypeScript and Java (without JNI) for example.

We can certainly change the contract and ask developers who wrote persistent workers to add a signal handler, but maybe the overall time and churn is better spent on coming up with a new protocol version that addresses the existing known issues. :)

In certain languages / environments it's difficult to implement a gRPC
server...

On Wed, Nov 29, 2017 at 4:24 PM, Philipp Wollermann <
[email protected]> wrote:

@ulfjack https://github.com/ulfjack Yes, but this is not in the current
contract with any existing workers, so sending SIGTERM will result in
killing of the workers if implemented now. Some concerns:

  • It's not clear if SIGTERM is the right signal for this purpose.
  • We don't have signals on Windows.
  • In certain languages / environments it's difficult to catch signals
    and install a signal handler. I could imagine this to be a problem in
    TypeScript and Java (without JNI) for example.

We can certainly change the contract and ask developers who wrote
persistent workers to add a signal handler, but maybe the overall time and
churn is better spent on coming up with a new protocol version that
addresses the existing known issues. :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/614#issuecomment-347894058,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHA9Ycgzu3wW9HnmfiQ2pjQ6O5b3NVLoks5s7Xc7gaJpZM4Gje43
.

@ulfjack I didn’t say that the new protocol would use gRPC and we agreed not even one week ago that improvements to the persistent worker would be put on hold, as you want me to focus on more urgent work first.

Yet I have the impression that you’re playing devils advocate for some thing here and reuse phrases of my words for some reason. What’s your point?

We already have a proposal in our internal issue tracker (AFAIR) for a backwards compatible change that even adds support for executing actions in parallel. I can send you two a link tomorrow.

I'm not suggesting that we do any work on this in the near future, but it makes sense for us to agree on what a solution would look like, if someone else wants to contribute.

  • I think using gRPC is a no-go.
  • I think sending SIGTERM is a reasonable thing for us to do on Linux and MacOS - I realize it's a backwards-incompatible change in the protocol, so would need to be communicated to whoever is maintaining worker implementations. In the worst case, they can always use a tiny wrapper C binary that ignores SIGTERM.

what about my original suggestion to let the worker continue, but ignore its results? (maybe not possible if you are writing the output files directly from the worker)

Is there a ‘bazel’ command to cleanly stop the server that was started after running Bazel? Or is there a way to prevent that the server is even started?

Ctrl-C should just stop all processes that were started by this command.

@boegel have you tried : bazel shutdown.
https://docs.bazel.build/versions/master/user-manual.html#shutdown

Ctrl + C thrice should shutdown the server immideatly , but look out for side effects mentioned in the comments above

AFAIK , you cannot get around the server since that is how bazel starts up, but a batch mode is there which will immideatly shutdown the server after a build, but see the comments in this issue to weigh pros and cons https://github.com/bazelbuild/bazel/issues/3051

Reclassifying this as a bug and moving to the local execution team. My reading of the history is that this bug is actually tracking the inability to cancel jobs scheduled on workers, which results in Ctrl+C not being immediately honored.

I would like to reiterate how useful it would be for the Bazel to send SIGTERM (or SIGINT, or whichever signal Bazel itself received) to workers.

In my org's build system, we have a Bazel persistent worker that can, in some cases, take more than a minute to complete its work. We also have our own "watch" system set up, so that if source files change, the current build will be aborted, and a new one will be started. Among other things, we send SIGTERM to Bazel.

As a hack workaround, I'm experimenting with tweaking our own build system to manually send SIGTERM to our persistent workers. But it's messy. It would be easier and cleaner if Bazel sent a signal to workers.

@mmorearty FYI cmake is free of all these problems. I only came across bazel because some projects that come from google only come with bazel build files, and I needed to have them ported to BSDs. Otherwise, bazel is not a superior system, and cmake and meson are much better for the task.

@mmorearty The current behavior is by design, as the worker protocol does not allow running commands to be cancelled (which would be the correct way to fix this) and killing the workers with SIGTERM would result in a performance degradation on the next build, as all the workers have to be spawned again and have to go through JIT and populating their caches again.

That said, maybe our design considerations are wrong and we should change the behavior to kill workers on every interruption. That's entirely possible, but we would need to discuss this with the other worker authors before making a change.

If you would like to see this change, can you bring it up on bazel-discuss@? I'll Cc a few worker authors that I know of.

@yurivict Well, it's easy to be free of these "problems" if you just don't support the feature, like cmake. Feel free to disable workers (--strategy=Javac=sandboxed should be enough) in Bazel and your Bazel build will behave the same as a cmake build. We will also implement more robust process management via PROC_REAP_ACQUIRE and PROC_REAP_KILL on FreeBSD soon. On Linux we already use PID namespaces to ensure that child processes are reliably killed when Bazel exits, on Windows we use job objects for the same. (We're not aware of a similar mechanism for macOS unfortunately.)

Now that Angular 9 is at the horizon (which seems to use Bazl under the hood), this issue also popped up for me. I started a build, immediately recognized that a dependency was out of date and attempted to stop via CTRL + C as it was possible before. This was the result:

0% compiling^C^C
Compiling @angular/core : es2015 as esm2015
^C
Compiling @angular/common : es2015 as esm2015
^C^C^C
Compiling @angular/common/http : es2015 as esm2015
^C
Compiling @angular/common/http/testing : es2015 as esm2015
^C^C^C^C^C^C^C^C
Compiling @angular/platform-browser : es2015 as esm2015
^C^C^C
Compiling @angular/router : es2015 as esm2015
^C^C^C^C^C^C
Compiling @angular/cdk/portal : es2015 as esm2015
^C^C^C^C^C^C^C^C^C
Compiling @angular/forms : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/cdk/keycodes : es2015 as esm2015

Compiling @angular/cdk/platform : es2015 as esm2015
^C
Compiling @angular/cdk/coercion : es2015 as esm2015
^C^C^C^C
Compiling @angular/cdk/observers : es2015 as esm2015
^C^C^C^C
Compiling @angular/cdk/a11y : es2015 as esm2015
^C^C^C^C
Compiling @angular/cdk/bidi : es2015 as esm2015
^C^C^C
Compiling @angular/cdk : es2015 as esm2015

Compiling @angular/animations : es2015 as esm2015
^C^C
Compiling @angular/animations/browser : es2015 as esm2015
^C^C^C^C
Compiling @angular/platform-browser/animations : es2015 as esm2015
^C^C^C
Compiling @angular/material/core : es2015 as esm2015

Compiling @angular/cdk/collections : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/cdk/scrolling : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/cdk/overlay : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/material/form-field : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/material/autocomplete : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/material/badge : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/cdk/layout : es2015 as esm2015
^C^C^C^C^C^C^C^C^C^C^C^C^C^C
Compiling @angular/material/bottom-sheet : es2015 as esm2015

Compiling @angular/material/button : es2015 as esm2015

Compiling @angular/material/button-toggle : es2015 as esm2015

Compiling @angular/material/card : es2015 as esm2015

Compiling @angular/material/checkbox : es2015 as esm2015

Compiling @angular/material/chips : es2015 as esm2015

Compiling @angular/material/dialog : es2015 as esm2015

The process did terminate once the Angular compiler did its thing and Bazel regained control, but I had to wait almost two minutes until that happened. In this scenario it would therefore have been nice if Bazel relayed the CTRL + C to the currently running processes.

Was this page helpful?
0 / 5 - 0 ratings