Nix: Use of PR_SET_PDEATHSIG results in Broken pipe errors when using SSH substituters

Created on 2 Sep 2018  路  18Comments  路  Source: NixOS/nix

Problem

Nix 2.0.4.

Sometimes (on my desktop, often, namely in ~66% of the cases), nix (on Ubuntu 16.04) fails when fetching stuff from my binary cache (NixOS 18.03).

Message:

copying path '/nix/store/4nny4bm8aflicjalfnrfbkfdd5b54nv3-conduit-1.2.13.1.tar.gz' from 'ssh://[email protected]'...
writing to file: Broken pipe

It's nondetemrinistic, so trying it multiple times usually helps, but it is still very annoying.

Investigation

strace'd the failure, saw:

6837  write(10<pipe:[7624621]>, "\3\0\0\0\0\0\0\0C\0\0\0\0\0\0\0/nix/store/4nny4bm8aflicjalfnrfbkfdd5b54nv3-conduit-1.2.13.1.tar.gz\0\0\0\0\0", 88) = -1 EPIPE (Broken pipe)

Searched for the pipe number 7624621 further up, found (starting from the first occurrence):

6830  pipe2([9<pipe:[7624621]>, 10<pipe:[7624621]>], O_CLOEXEC) = 0
6830  pipe2([11<pipe:[7624622]>, 12<pipe:[7624622]>], O_CLOEXEC) = 0
6830  vfork( <unfinished ...>
6833  prctl(PR_SET_PDEATHSIG, SIGKILL)  = 0
6833  rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
6833  close(10<pipe:[7624621]>)         = 0
6833  close(11<pipe:[7624622]>)         = 0
6833  dup2(9<pipe:[7624621]>, 0</dev/pts/12>) = 0<pipe:[7624621]>

Noticed the prctl(PR_SET_PDEATHSIG, SIGKILL) = 0 in there.

Looked up https://stackoverflow.com/questions/10761197/prctlpr-set-pdeathsig-signal-is-called-on-parent-thread-exit-not-parent-proc for details.

Grepped for PR_SET_PDEATHSIG in nix's code, found:

https://github.com/NixOS/nix/blob/54996b51fba46e8f287b1ac6a58854e62b476985/src/libutil/util.cc#L926-L927

Commented out the lines:

            if (options.dieWithParent && prctl(PR_SET_PDEATHSIG, SIGKILL) == -1)
                throw SysError("setting death signal");

And the problem disappeared.

bug

All 18 comments

The change that introduced PR_SET_PDEATHSIG is already 4 years old: https://github.com/NixOS/nix/commit/809ca33806d75eeabb9c668b124762fb6462e5bc

But I didn't have this problem with nix before I upgraded to nix 2.0.

So I suspect something around nix 2.0 changed the logic that eventually results in the relevant thread disappear in a racy fashion.

For the record, I had first tried to strace the process, which showed that the ssh invocation is SIGKILLed:

% nix-store --delete /nix/store/4nny4bm8aflicjalfnrfbkfdd5b54nv3-conduit-1.2.13.1.tar.gz && strace -s 3000 -fy -e execve $HOME/opt/nix-master/bin/nix-shell -j4 -I /home/niklas/myproject/nix-channel --argstr gitSha 2dc94806e010cf17be2e8a74a39aabe3334e3481 --arg gitDirty true --argstr cpuMarch haswell --option extra-substituters 'ssh://[email protected]' packages.nix -A haskellPackages.conduit --run '' 2>&1 | grep -v SIGCHLD | grep -v 'strace: Process' | grep -v '+++ exited with' | grep -v ENOENT
finding garbage collector roots...
removing stale temporary roots file '/nix/var/nix/temproots/19223'
deleting '/nix/store/4nny4bm8aflicjalfnrfbkfdd5b54nv3-conduit-1.2.13.1.tar.gz'
deleting '/nix/store/trash'
deleting unused links...
note: currently hard linking saves -0.00 MiB
1 store paths deleted, 0.05 MiB freed
execve("/home/niklas/opt/nix-master/bin/nix-shell", ["/home/niklas/opt/nix-master/bin/nix-shell", "-j4", "-I", "/home/niklas/myproject/nix-channel", "--argstr", "gitSha", "2dc94806e010cf17be2e8a74a39aabe3334e3481", "--arg", "gitDirty", "true", "--argstr", "cpuMarch", "haswell", "--option", "extra-substituters", "ssh://[email protected]", "packages.nix", "-A", "haskellPackages.conduit", "--run", ""], [/* 47 vars */]) = 0
[pid 20862] execve("/usr/bin/ssh", ["ssh", "[email protected]", "-x", "-a", "nix-store --serve --write"], [/* 48 vars */]) = 0
Authenticated to mymachine.example.com ([1.2.3.4]:22).
these paths will be fetched (0.00 MiB download, 0.05 MiB unpacked):
  /nix/store/4nny4bm8aflicjalfnrfbkfdd5b54nv3-conduit-1.2.13.1.tar.gz
[pid 20862] +++ killed by SIGKILL +++
copying path '/nix/store/4nny4bm8aflicjalfnrfbkfdd5b54nv3-conduit-1.2.13.1.tar.gz' from 'ssh://[email protected]'...
writing to file: Broken pipe
copying path '/nix/store/4nny4bm8aflicjalfnrfbkfdd5b54nv3-conduit-1.2.13.1.tar.gz' from 'https://cache.nixos.org'...
[pid 20857] execve("/nix/store/496rv6cn7n317jfbrid3662f4l7jf0is-bash-4.4-p12/bin/bash", ["bash", "/tmp/nix-shell-20857-0/rc"], [/* 92 vars */] <unfinished ...>
<... execve resumed> )                  = 0
[pid 20870] execve("/bin/rm", ["rm", "-rf", "/tmp/nix-shell-20857-0"], [/* 92 vars */]) = 0
[pid 21306] execve("/nix/store/920k63py2w97xpfyb5ps9l6wbidvzfjz-coreutils-8.29/bin/tr", ["tr", "[:lower:]", "[:upper:]"], [/* 104 vars */]) = 0
[... similar lines ...]
[pid 21339] execve("/nix/store/920k63py2w97xpfyb5ps9l6wbidvzfjz-coreutils-8.29/bin/tr", ["tr", "[:lower:]", "[:upper:]"], [/* 115 vars */]) = 0

[pid 20862] +++ killed by SIGKILL +++

Details:

[pid 27838] execve("/usr/bin/ssh", ["ssh", "[email protected]", "-x", "-a", "nix-store --serve --write"], [/* 48 vars */]) = 0
Authenticated to mymachine.example.com ([1.2.3.4]:22).
[pid 27837] +++ exited with 0 +++
[pid 27835] +++ exited with 0 +++
these paths will be fetched (0.00 MiB download, 0.05 MiB unpacked):
[pid 27838] +++ killed by SIGKILL +++
[pid 27836] +++ exited with 0 +++
[pid 27831] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=27838, si_uid=1000, si_status=SIGKILL, si_utime=2, si_stime=1} ---

From https://stackoverflow.com/questions/26285133/who-sends-a-sigkill-to-my-process-mysteriously-on-ubuntu-server

% sudo apt-get install auditd audispd-plugins
% sudo auditctl -a exit,always -F arch=b64 -S kill -F a1=9 -k kill_signals
% sudo auditctl -l
-a always,exit -F arch=b64 -S kill -F a1=0x9 -F key=kill_signals

When it happens, I see these entries appear:

% sudo ausearch -sc kill                                                                             
...
time->Sun Sep  2 22:49:18 2018
type=PROCTITLE msg=audit(1535921358.481:2704): proctitle=737472616365002D730033303030002D6679002D6500657865637665002F686F6D652F6E696B6C61732F6F70742F6E69782D6D61737465722F62696E2F6E69782D7368656C6C002D6A34002D49002F686F6D652F6E696B6C61732F7372632F62656E61636F2D7373642F62656E61636F2F6E69782D6368616E6E656C002D2D61
type=OBJ_PID msg=audit(1535921358.481:2704): opid=30610 oauid=1000 ouid=1000 oses=5650 ocomm="strace"
type=SYSCALL msg=audit(1535921358.481:2704): arch=c000003e syscall=62 success=yes exit=0 a0=7792 a1=9 a2=0 a3=0 items=0 ppid=24535 pid=30607 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts12 ses=5650 comm="strace" exe="/usr/bin/strace" key="kill_signals"

However, they don't appear without strace, and pid=30607 is strace's PID, so it appeared as if strace itself was sending this signal!

And when I don't use strace, no kill appears in auditd.


So it looks like the effects of PR_SET_PDEATHSIG are visible with strace (and appear weirdly, as if strace was the signal sender), but aren't visible with perf or auditd.

In Nix < 2.0 we didn't call SSH from C++ but only from Perl code (like nix-copy-closure) which didn't set PR_SET_PDEATHSIG.

I should also mention: This makes SSH substitutors almost unusable for me, as e.g. for the tested package it happens in 66% of the cases for me.

Disclaimer: I ran across this issue while searching for a related issue. I've never used Nix before.

PR_SET_DEATHSIG will kill the child process if the parent dies -- but the parent is the thread, not the process. So if the process was launched from some thread that died, that could cause the issue that you're seeing (which bit me in my project earlier today).

Warning: the "parent" in this case is considered to be the thread that created this process. In other words, the signal will be sent when that thread terminates (via, for example, pthread_exit(3)), rather than after all of the threads in the parent process terminate.

Reference: http://man7.org/linux/man-pages/man2/prctl.2.html (and stackoverflow, but I've since lost that link)

Disclaimer: I ran across this issue while searching for a related issue. I've never used Nix before.

Yey, I actually posted as much information as possible in the hope that people would find this in related issues. Happy it worked. Didn't expect it to work so soon, looks like PR_SET_DEATHSIG is popular these days ;)

and stackoverflow, but I've since lost that link

@virtuald It might be the one I linked in the issue description above.

I am also affected by this issue and have found that adding -vv can change the timing enough to avoid a consistent Broken pipe.

Just adding that this is an issue preventing my company from using ssh to serve our cache, so it would be good to see some kind of resolution on this. For now we are going to try to write via ssh and read via http wih nix-serve

nh2 would love to! Probably not going to happen soon, but will keep it in mind.

Any progress on this?

It looks like using PR_SET_PDEATHSIG is the right thing to do, which I think means making sure the spawning thread doesn't end until the subprocess has exited. @nh2 have you looked in to why that is happening?

In my case it only fails when using nix-shell with an ssh substituter. In the case of nix-shell the failure rate is even at ~90%. However I don't see any problems with nix-env -i.

@ivan @Camsbury Could you try my workaround?

nh2@4913e64

For me the workaround works well.

I looked a bit into the issue. As @nh2 already explained, the SSH client is killed because the spawning thread exits. This is caused - as seen above - by PR_SET_PDEATHSIG. The usage of PR_SET_PDEATHSIG makes sense, as we do not want an orphaned SSH client in the case we kill the nix-daemon.

However the problem is, that the SSH client is started lazily. When starting to query missing paths, a bunch of worker threads are created. Each missing path is enqueued and then assigned to a worker thread. The first worker thread, that requires a connection to the SSH binary cache creates the SSH connection. This SSH connection is supposed to stay alive until all missing paths have been downloaded.

The worker threads are only kept alive as long as there is something to be processed (with the exception of the main thread, which is a worker thread as well). If all tasks have been performed for the time being, the worker threads are shutdown. A shutdown of all worker threads can occur at least twice during one execution. That means, that the thread that has started the SSH client, might get terminated if it was not the main thread that started the SSH client. After the first shutdown of the worker threads has been done, the worker threads are restarted to query some more paths. The newly started worker threads expect that the SSH connection is still alive and therefore fail when the SSH connection had been terminated before.

The issue is non-deterministic as it is possible, that the main thread is the first worker thread which requires the SSH connection. If this happens, the SSH client will not be shutdown as the main thread never gets terminated.

@chpatrick made a backport of this fix for nix-2.3.2 which is on NixOS 20.03: https://github.com/chpatrick/nix/commit/ebaba525fc879f9ee3461b07ad03cd2111d27479

@edolstra Could a new nix point release be made includes th fix for this issue?

We had lots of trouble using nix in production without my fix https://github.com/NixOS/nix/issues/2395#issuecomment-506361263 or the backport of the real fix (https://github.com/chpatrick/nix/commit/ebaba525fc879f9ee3461b07ad03cd2111d27479) because you get broken pipe in 100% of the cases for ssh substituters.

Nix 2.3.4 has it, thanks @edolstra!

Was this page helpful?
0 / 5 - 0 ratings