Rust: On Unix-like OSes, std::process::Command::spawn() should not use assert! from the child process after fork()

Created on 30 Jun 2020  路  29Comments  路  Source: rust-lang/rust

On Unix-like OSes, std::process::Command::spawn() uses an assert! to check if the pipe I/O succeeded. If that assert! fails, it will call panic!, which is not signal-safe. It's better to use if ....is_err() { std::intrinsics::abort() } there.
libc::abort() (std::process::abort()) shouldn't be used as a replacement, as, at least in glibc, it's not signal-safe[1][2], although all of the C standard, the C++ standard, the POSIX standard, and Linux man-pages say it should be.

[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=stdlib/abort.c;h=df98782dd7ea6c1476184a365bd9f3954f481a18;hb=refs/heads/master#l54
[2] https://www.gnu.org/software/libc/manual/html_node/Aborting-a-Program.html#Aborting-a-Program


C-bug T-libs libs-impl

Most helpful comment

@tmiasko I have a fix for glibc. Locks are basically required to avoid seeing the unmasked signal state in the child. However, you have to take that lock during fork to ensure that you own and can unlock the lock in the child (avoids the deadlock). The regression test was not straight forward to write and on average it only reproduces once every 100-600 seconds on an x86_64 VM with 4vCPU. That's good enough though and I run the test for 20s to probabilistically try to catch if the defect comes back. On average if I ran the regression test just once per dev-test cycle it might take something like 32 years to show the defect :-( Thankfully more people than just the dev-test cycle will be using the API.

All 29 comments

It seems that nobody is aware of this... I know many libraries (such as the glibc above) don't care about signal safety in such situation. If you're not aware of this, feel free to close this issue.

Closing as there is no response.

cc @joshtriplett

This is a good catch. We certainly try to be signal-safe in such contexts, and failing to do so is a bug. Sorry to see that this didn't get a response. I'm reopening this, as it remains a bug.

I discovered that if libc::abort() is not signal-safe (in glibc), then rtabort!() is not signal-safe, so the signal handlers in the standard library that call rtabort!() is actually not signal-safe:
https://github.com/rust-lang/rust/blob/891e6fee572009ff2be4d4057fb33483610c36a7/src/libstd/sys/unix/stack_overflow.rs#L92-L106
Maybe I should open a new issue or report the abort(3) bug to glibc...

On Mon, Jul 20, 2020 at 01:32:22AM -0700, hyd-dev wrote:

I just realized if libc::abort() is not signal-safe (in glibc), then rtabort!() is not signal-safe, so the signal handlers in the standard library that call rtabort!() is actually not signal-safe:
https://github.com/rust-lang/rust/blob/891e6fee572009ff2be4d4057fb33483610c36a7/src/libstd/sys/unix/stack_overflow.rs#L92-L106
Maybe I should open a new issue or report the abort(3) bug to glibc.

signal-safety(7) lists abort(3) as async-signal-safe. If it isn't in
practice async-signal-safe in glibc, that's a bug and you should report
it.

On Mon, Jul 20, 2020 at 01:32:22AM -0700, hyd-dev wrote:
I just realized if libc::abort() is not signal-safe (in glibc), then rtabort!() is not signal-safe, so the signal handlers in the standard library that call rtabort!() is actually not signal-safe: https://github.com/rust-lang/rust/blob/891e6fee572009ff2be4d4057fb33483610c36a7/src/libstd/sys/unix/stack_overflow.rs#L92-L106 Maybe I should open a new issue or report the abort(3) bug to glibc.

signal-safety(7) lists abort(3) as async-signal-safe. If it isn't in practice async-signal-safe in glibc, that's a bug and you should report it.

Reported as https://sourceware.org/bugzilla/show_bug.cgi?id=26275.

So I think libc::abort() can be used, and if ... { something_like_rtabort!() } is suitable as a replacement. WDYT?

So I think libc::abort() can be used, and if ... { something_like_rtabort!() } is suitable as a replacement. WDYT?

@joshtriplett Should this be fixed like that?

Given the above research, I don't think libc::abort() is safe to use either.

Using std::intrinsics::abort() seems like a potentially reasonable approach here. (std::process::abort() just calls libc::abort().)

I would prefer just using rtabort! since that is precisely what it was designed for.

That call chain is rtabort! -> sys_common::util::abort() -> sys::abort_internal(), which on unix calls libc::abort().

https://github.com/rust-lang/rust/blob/e5e33ebd2ba12a78dbf6e2d5f154d5f71f28576c/library/std/src/sys/unix/mod.rs#L159-L168

The note about flushing files isn't something we can rely on:

https://man7.org/linux/man-pages/man3/abort.3.html

   Up until glibc 2.26, if the abort() function caused process
   termination, all open streams were closed and flushed (as with
   fclose(3)).  However, in some cases this could result in deadlocks
   and data corruption.  Therefore, starting with glibc 2.27, abort()
   terminates the process without flushing streams.  POSIX.1 permits
   either possible behavior, saying that abort() "may include an attempt
   to effect fclose() on all open streams".

In upstream glibc the goal is for abort() to be AS-safe as required by the standard and it should be today (after glibc 2.27). In fact doing anything superfluous during the abort() generally constitutes a security hazard because the process can be in an unknown state when corruption is detected and the abort() called. For the sake of reducing restart latency and avoiding executing any compromised parts of the process we shut down as quickly as possible. We strongly recommend out-of-process backtracing to avoid needing to run any of that code in the process with bad or unknown state.

In upstream glibc the goal is for abort() to be AS-safe as required by the standard and it should be today (after glibc 2.27).

It's not. It tries to hold a lock.

Example demonstrating a deadlock. After a dozen of so runs leaves deadlocked children behind:

#include <assert.h>
#include <pthread.h>
#include <signal.h>
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>

void *lazy_abort(void *arg) {
  struct timespec req = {.tv_sec = 0, .tv_nsec = 1000000 };
  nanosleep(&req, NULL);
  abort();
  return NULL;
}

int main() {
  pthread_t thread;
  signal(SIGABRT, SIG_IGN);
  int r = pthread_create(&thread, NULL, lazy_abort, NULL);
  assert(r == 0);

  while (1) {
    pid_t pid = fork();
    assert(pid != -1);
    if (pid == 0) {
      abort();
    }
  }
}

At least the conditions for deadlock are quite unlikely, since both a child process and a thread in the parent process must abort.

The musl implementation looks correct (it also uses a lock, which by itself is only a weak indicator of a potential issue).

@tmiasko You may want to post that on the glibc bug report here: https://sourceware.org/bugzilla/show_bug.cgi?id=26275

@tmiasko I have a fix for glibc. Locks are basically required to avoid seeing the unmasked signal state in the child. However, you have to take that lock during fork to ensure that you own and can unlock the lock in the child (avoids the deadlock). The regression test was not straight forward to write and on average it only reproduces once every 100-600 seconds on an x86_64 VM with 4vCPU. That's good enough though and I run the test for 20s to probabilistically try to catch if the defect comes back. On average if I ran the regression test just once per dev-test cycle it might take something like 32 years to show the defect :-( Thankfully more people than just the dev-test cycle will be using the API.

Unfortunately, avoiding abort() may still be required after the patch lands, to wait for it to be released and included in GNU/Linux distributions.

I think we can make sys::abort_internal() call intrinsics::abort() when using glibc and then use rtabort! in spawn().

Please don't use intrinsics::abort. If you really want then raise SIGABRT instead.

It is: https://www.man7.org/linux/man-pages/man7/signal-safety.7.html

Although you should probably use pthread_kill, pthread_sigmask and sigaction to properly implement an abort replacement (all of which are also signal-safe).

It is: https://www.man7.org/linux/man-pages/man7/signal-safety.7.html

Although you should probably use pthread_kill, pthread_sigmask and sigaction to properly implement an abort replacement (all of which are also signal-safe).

Sorry. I confused. You're correct.

Although you should probably use pthread_kill, pthread_sigmask and sigaction to properly implement an abort replacement (all of which are also signal-safe).

@Amanieu https://man7.org/linux/man-pages/man3/raise.3.html#DESCRIPTION says raise(sig) is equivalent to pthread_kill(pthread_self(), sig). I think pthread_sigmask(), sigaction() and just raise() can be used to implement the replacement.

I can submit a PR to do that.
@rustbot claim

IMO we should just leave it as glibc's acknowledged bug, unless folks are actively hitting this problem. While we do hope to be technically "pure", the actual abort race seems like a very artificial situation that I doubt is a problem in practice.

IMO we should just leave it as glibc's acknowledged bug, unless folks are actively hitting this problem.

@cuviper @Amanieu I've found that if I have to avoid intrinsics::abort() and use libc::raise(), I almost need to reimplement libc::abort() and port its tests to std. Maybe it's really better to keep using libc::abort() and just replace the assert! with rtassert!.

While we do hope to be technically "pure", the actual abort race seems like a very artificial situation that I doubt is a problem in practice.

Actually that's what my "don't care about signal safety in such situation" mean. But, again, feel free to ignore the race if it's really not a problem in practice. Also, thanks @codonell for fixing the race in glibc. The race will eventually get fixed (by updating glibc).

I can still submit a PR to just change the assert! to rtassert!. That assert! unsafety shouldn't be ignored.

Was this page helpful?
0 / 5 - 0 ratings