Fish-shell: Killing fish reparents some child processes into CPU hogs

Created on 13 Dec 2016 · 82Comments · Source: fish-shell/fish-shell

[ x] Have you checked if problem occurs with fish 2.4.0?
[ x] Tried fish without third-party customizations (check sh -c 'env HOME=$(mktemp -d) fish')?

fish version installed (fish --version): 2.4.0

OS/terminal used: Arch Linux / URxvt 9.22

I think issue #3155 was closed a bit early, and since nobody has responded to my late comment I'm opening a new issue to sum up and clarify the matter at hand.

There seems to be a race condition during the termination of fish when running a graphical process (e.g. ncurses).

The issue happens randomly, yet very often when ranger and some fzf widgets are involved.
It seems to happen less often with cmus, ncdu and other ncurses applications.

The OP of #3155 has mentioned that man also triggers the issue. I cannot reproduce, however it might depend on the definition of man (is it the original fish function?) and the pager in use.

Reproduction steps

I have a ranger-cd function for syncing ranger and fish:

function ranger-cd -d 'Run ranger and sync folders with shell'
    [ (count $argv) -eq 0 ]; and set argv .
    set -l tempfile (mktemp)
    command ranger --choosedir="$tempfile" $argv
    if grep -q . "$tempfile" ^ /dev/null
        set -l dest (cat -- "$tempfile")
        [ "$dest" != "$PWD" ]; and cd $dest; and commandline -f repaint
    end
    rm -f -- "$tempfile"
end

Running ranger-cd and closing the terminal will (almost always) reparent ranger to PID1 and make it a CPU hog.

I noticed that if I make the function shorter:

function ranger-cd -d 'Run ranger and sync folders with shell'
    [ (count $argv) -eq 0 ]; and set argv .
    set -l tempfile (mktemp)
    command ranger --choosedir="$tempfile" $argv
end

This almost never happens. Adding a few echos is enough to trigger the issue back again:

function ranger-cd -d 'Run ranger and sync folders with shell'
    [ (count $argv) -eq 0 ]; and set argv .
    set -l tempfile (mktemp)
    command ranger --choosedir="$tempfile" $argv
    echo
    echo
    echo
    # ... I suppose you should add more echo if your machine is fast.
end

Same thing with fzf's file-widget.
The signals sent to a hogging fzf:

> ps -ef | grep 32718
ambrevar   479   363  0 13:18 pts/6    00:00:00 grep --color=auto 32718
ambrevar 32718 32714 93 13:17 ?        00:00:33 /usr/bin/fzf -m --bind=ctrl-j:execute-multi(rifle {}) --preview=preview {}

>  grep Sig /proc/32718/task/32718/status
SigQ:   0/15135
SigPnd: 0000000000000000
SigBlk: fffffffe73fa3a25
SigIgn: 0000000000000000
SigCgt: ffffffffffc9feff

> grep Sig /proc/32714/task/32714/status
SigQ:   0/15135
SigPnd: 0000000000000000
SigBlk: 0000000000010000
SigIgn: 0000000000000004
SigCgt: 0000000000010002

The strace of fzf is huge and I'm not sure what to look after.

EDIT: Happens outside functions too

bug

Source

Ambrevar

Most helpful comment

It took a few attempts but I was able to reproduce this on Ubuntu 16.04. The problem is that ranger is ignoring the EOF indication when it reads from the tty. From the strace output it goes into a tight loop stat()'ing various files and directories after each attempt to read a character from the tty fails:

read(0, "", 1)                          = 0

So that's a bug in ranger but there is something else going on.

Note that in my case the ranger process is still parented to the fish process and the fish process was reparented to pid 1. The fish process is waiting for the ranger process to exit. Manually sending the ranger process a SIGHUP turns it into a zombie. It should have received a SIGHUP when I killed the ssh daemon. So we need to investigate whether there is a bug in how the controlling terminal process group is being managed. Something else that is odd is the parent fish process isn't running and reaping its now dead child process. It's blocked on a futex.

krader1961 on 14 Dec 2016

👍2

All 82 comments

Thanks for the great bug report! Conceivably related to #422

ridiculousfish on 13 Dec 2016

read(0, "", 1)                          = 0

So that's a bug in ranger but there is something else going on.

krader1961 on 14 Dec 2016

👍2

Before doing the read(0,...) that returns zero ranger polls fd 0. And the response has revents = POLLIN|POLLERR|POLLHUP which pretty clearly tells it that the tty is no longer valid. Ranger is badly broken with respect to how it interacts with the tty. But that bug wouldn't matter if fish didn't also have a bug.

The strace output shows that the SIGHUP is being delivered to fish, not the ranger child process. Fish then attempts to call tcsetpgrp() (or ioctl(0, TIOCSPGRP)) which fails with errno ENOTTY, "Inappropriate ioctl for device". The attempt to write that error also fails because the tty has gone away. It looks like it tries to call tcsetpgrp() twice then goes to sleep on a futex.

What is odd is it successfully calls tcsetpgrp() using the childs PID before calling execve("/usr/bin/ranger",...). A little bit later while the ranger process is still starting up it does the tcsetpgrp() twice in the parent fish shell with the correct PID for the child ranger process. So it isn't obvious why the SIGHUP was sent to the parent fish process rather than the child ranger process. For that matter even if the SIGHUP should go to the parent fish process why doesn't it propagate the signal to its children?

Also, if I run the ranger program from a shell prompt rather than through an intermediate fish function fish correctly propagates the SIGHUP to the ranger process thus killing it. This suggests there is a mistake in how external commands are managed when invoked from a function. Possibly related to issue #2980.

krader1961 on 14 Dec 2016

@Ambrevar: Gitter told me you edited this issue after my previous comments but I can't tell what you changed. Sadly Github doesn't show those edits. Did you just fix spelling/grammar errors or did you make a substantive change to your problem statement?

krader1961 on 14 Dec 2016

Just typos, all good!

Ambrevar on 14 Dec 2016

Amazing investigative work krader! Are you able to get a backtrace from fish when it's waiting in this futex?

ridiculousfish on 14 Dec 2016

Okay, when fish is hung on a futex the stack backtrace looks like the following. It's not obvious why that would hang in the main thread. This implies that we did a malloc or free on the child side of the fork and still hold the malloc arena lock.

#0  __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007f9f3168fd4d in _int_free (av=0x7f9f319d3b20 <main_arena>, p=0x1fc32a0, have_lock=0) at malloc.c:3960
#2  0x00007f9f3169398c in __GI___libc_free (mem=<optimized out>) at malloc.c:2966
#3  0x0000000000451abe in __gnu_cxx::new_allocator<wchar_t>::deallocate (this=<optimized out>, __p=<optimized out>) at /usr/include/c++/5/ext/new_allocator.h:110
#4  std::allocator_traits<std::allocator<wchar_t> >::deallocate (__a=..., __n=<optimized out>, __p=<optimized out>) at /usr/include/c++/5/bits/alloc_traits.h:517
#5  std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >::_M_destroy (__size=<optimized out>, this=0x7ffcf98a6460)
    at /usr/include/c++/5/bits/basic_string.h:185
#6  std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >::_M_dispose (this=0x7ffcf98a6460) at /usr/include/c++/5/bits/basic_string.h:180
#7  std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >::~basic_string (this=0x7ffcf98a6460, __in_chrg=<optimized out>)
    at /usr/include/c++/5/bits/basic_string.h:543
#8  debug (level=level@entry=1, msg=<optimized out>) at src/common.cpp:566
#9  0x00000000004b9469 in term_steal () at src/reader.cpp:345
#10 reader_run_command (parser=..., cmd=L"ranger-cd") at src/reader.cpp:1959
#11 0x00000000004bd6f4 in read_i () at src/reader.cpp:2313
#12 reader_read (fd=fd@entry=0, io=...) at src/reader.cpp:3381
#13 0x000000000042c508 in main (argc=1, argv=<optimized out>) at src/fish.cpp:491

krader1961 on 15 Dec 2016

Changing the debug() call in term_steal() to debug_safe() "fixes" the problem. Fish correctly sends a SIGHUP to the ranger child process then exits. I say fixes in scare quotes because it shouldn't be necessary to use the safe variant there. The question is how to figure out why the malloc arena lock is held causing the string destructor to hang in free() (or deallocate() as the case may be).

krader1961 on 15 Dec 2016

Nice! Can you check if there are any other threads active and what their backtraces are?

This could be explained if this were a forked child. However this backtrace does not look like it: fish's children should always be under exec_job, and so should never have that backtrace. (My understanding is that parents shouldn't get deadlocked even if the children inherit locked mutexes.)

Some googling turned up a few potential futex_wait hangs: a long shot in the kernel and slightly more likely in glibc. These are admittedly long shots but it feels a more likely to me.

Can you check to see whether you have glibc version < 2.12 ? ldd --version will output this. Also, Linux kernel version < 3.18 (uname -r)?

ridiculousfish on 15 Dec 2016

This is Ubuntu 16.04 so everything is pretty recent: ldd (Ubuntu GLIBC 2.23-0ubuntu4) 2.23 and 4.4.0-47-generic. Running info threads shows only the main thread.

Another interesting wrinkle. I added some brute force write(2,...) statements to try and confirm where it is deadlocking. Then I logged in and ran exec fish -l -i -d5 >/tmp/fish.dbg and was unable to reproduce the hang. However, with the same debugging statements in place but not restarting the shell it still hangs. So the exec fish is changing the situation in some relevant manner. Even more interesting. If I simply add the following to main() I can no longer reproduce the hang:

close(2);
open("/tmp/fish.dbg", O_WRONLY);

After making that change simply logging in with ssh like I was doing before and running the ranger-cd function now properly exits both processes when I summarily kill the ssh connection. So there is something different about logging to a file versus the invalid tty. Very curious.

krader1961 on 15 Dec 2016

Great find! Now I strongly believe you're seeing this unfixed wchar buffer overflow that was found in #3401. The conditions are just right: outputting to a FILE* with a file descriptor that won't accept any more data. This also explains why your debug_safe change fixes it: that goes directly to the fd instead of through stdio, so it avoids stdio's buffer overflow.

This buffer overflow typically manifests as stomping a function pointer used by malloc, which can lead either to a crash (as we found) or hang (as glibc found).

ridiculousfish on 15 Dec 2016

Bingo! I think we have a match. Now to ponder how best to work around that bug.

krader1961 on 15 Dec 2016

I should just submit a patch to glibc.

ridiculousfish on 15 Dec 2016

Sure but even if you did so tonight we're still going to be running fish on systems with a broken glibc for several years. 😢

krader1961 on 15 Dec 2016

For the record, @Ambrevar, I would still recommend you file a bug report against the ranger project. A well behaved program like vim does the right thing, exit rather than spin, when the tty goes away.

krader1961 on 15 Dec 2016

True. My memory is that in this case, fwprintf will return an error. One possibility is to catch that error, and then set a global variable to suppress further output. I think this will work because the error is returned before the buffer gets full. Still it's annoying because we have to look at all uses of stdio.

ridiculousfish on 15 Dec 2016

We definitely do not want to vet all uses of stdio. If for no other reason than we could trigger the bug before we knew there was a problem. Better to check a few strategic places, such as where we call tcsetpgrp(), for ENOTTY errors and reopen stdout and stderr on /dev/null. I considered reopening them on a path like /tmp/fish.$PID but deemed the diagnostic value too small relative to the problem of having people complain about such files magically appearing.

krader1961 on 15 Dec 2016

I'll report the bug to ranger and fzf. You believe the bug to be that they ignore EOF when reading from TTY, right?

I'm not sure I've understood the conclusion reached in the above discussion: is something to be fixed in fish at all?
I tend to think there should be, cause bash/zsh do not suffer from this issue (same glibc, same kernel).

Ambrevar on 15 Dec 2016

@Ambrevar: Fish does not have a bug per se. The broken behavior is due to a bug in the GNU libc (glibc) library that fish uses on your system (and the Ubuntu 16.04 system I'm using to reproduce this issue). That bash/zsh do not exhibit the broken behavior is probably because they have a different manner of handling such errors. Possibly because those projects also noticed the broken glibc behavior and have already worked around that bug.

There is definitely a bug in ranger in that it does not correctly handle fd 0 being attached to a no longer valid tty. I did not experiment with fzf but would not be surprised if it had a similar bug. As I said previously well behaved programs like vim do the right thing and exit rather than spin when stdin is no longer a valid tty.

krader1961 on 15 Dec 2016

Just tested the latest commit: sadly, this does not fix the issue :(

Ambrevar on 15 Dec 2016

My change fixes the problem for me on Ubuntu. If it doesn't fix it for you we're going to require more information from you. Starting with attaching to the fish parent process of the spinning ranger/fzf process with gdb and running the bt command. Also, in this type of situation it is all too common to think you are using a fixed version of the program when in fact you are not. So in this case reporting the output of echo $version in the shell that you are going to run ranger is a good idea.

krader1961 on 15 Dec 2016

I'm using the right version:

> echo $version 
2.4.0-232-g396bf12

I'll backtrace fzf and see what it reports.

Ambrevar on 15 Dec 2016

It is possible that fzf has a different failure mode. Are you able to reproduce this problem using ranger with my fix?

krader1961 on 15 Dec 2016

Another thing: how do you make sure the bug is still not there considering it occurs randomly?
It happens to me that fzf terminates properly 15 times in a row.

Ambrevar on 15 Dec 2016

Both ranger-cd and fzf still suffer from the issue.

Ambrevar on 15 Dec 2016

It does not happen randomly for me using your ranger-cd function. I can reliably reproduce the failure 100% of the time. Let us ignore fzf for the moment. If you can reproduce this problem using your ranger-cd function with my fix please do the following.

First, start a shell and echo %self. Make note of the PID it reports. Then reproduce the problem by running ranger-cd (with or without arguments should not matter). In a different shell run gdb -p $PID (replacing $PID with the pid reported by the echo %self) and type bt. You might need to preface that gdb invocation with sudo in order to attach to the fish process so you can debug it. What is the bt output?

krader1961 on 15 Dec 2016

Hmmm, not gonna work, fish gets terminated: it receives sighup, so if I run GDB after killing the terminal, it cannot find the PID. If I attach the process to GDB before, it shows

...
[New Thread 0x7f38d99f7700 (LWP 12417)]
[Thread 0x7f38d99f7700 (LWP 12417) exited]

Thread 1 "fish" received signal SIGHUP, Hangup.
0x00007f38db5d3afb in waitpid () from /usr/lib/libpthread.so.0

Ambrevar on 15 Dec 2016

Okay, let's trace what is happening. You'll need three terminals. Do the following:

1) term1: echo %self
2) term2: strace -f -ttt -o /tmp/fish_strace.txt -p $pid where $pid is the output of step 1.
3) term1: ranger-cd
4) term3: ps waux | grep ranger find the pid then grep Sig /proc/$pid/task/$pid/status
5) term1: close the terminal or whatever you've been doing to trigger the problem.
6) term3: if strace hasn't already exited kill $pid where $pid is from step 4.
7) attach /tmp/fish_strace.txt to this issue.

krader1961 on 15 Dec 2016

Using your latest commit I haven't been able to reproduce the issue with my original version of ranger-cd, only with the echo version. It also seems like the frequency of the bug has dramatically decreased, which makes it much harder to work on.
fzf file-widget, however, seems to have remained the same, maybe worse.

Either way, when I attach strace to the process, the bug does not occur :(

Ambrevar on 16 Dec 2016

The grep Sig /proc/32718/task/32718/status you included in your original comment shows that fzf is blocking a large number of signals including SIGHUP. So it's not surprising it isn't dying although how strace would affect that is not at all obvious. Please attach a trace where the problem does not occur since it might contain a clue. If it doesn't I'm out of ideas since I'm not going to bother to install and learn how to use fzf. We already know that both programs are broken in that they don't correctly handle the tty being closed irrespective of whether or not there is a still unidentified bug in fish.

krader1961 on 16 Dec 2016

We fixed one bug that caused this behavior and no one else can now reproduce the problem other than possibly @Ambrevar. And that involves a process, fzf, that is definitely behaving in an odd manner (e.g., blocking so many signals and not correctly handling the tty becoming invalid). So I'm going to close this in one week if no new data is provided by that deadline. Obviously if someone is still able to induce spinning processes when the tty is closed we'll be happy to take another look whether or not this issue has been closed.

krader1961 on 24 Dec 2016

Sorry for the late update.
I had no more success reproducing the bug while strace was attached to fish running fzf.
Here is what I got running fzf successfully.
2 possible output for the signal query:

> set pid (pgrep fzf); grep Sig /proc/$pid/task/$pid/status
SigQ:   0/15135
SigPnd: 0000000000000000
SigBlk: fffffffe73fa3a25
SigIgn: 0000000000000000
SigCgt: ffffffffffc9feff

> set pid (pgrep fzf); grep Sig /proc/$pid/task/$pid/status
SigQ:   0/15135
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: ffffffffffc9feff

And here is one strace: https://gist.github.com/Ambrevar/cbd257aafe556bfa666a68ab5eda8d76

Ambrevar on 24 Dec 2016

Have you tried with this function?

function ranger-cd
    [ (count $argv) -eq 0 ]; and set argv .
    set -l tempfile (mktemp)
    command ranger --choosedir="$tempfile" $argv
    echo
    echo
    echo
    # Might need more 'echo' here.
end

It still fails for me, so this is not specific to fzf.

Regarding fzf, it is easy to set up and has a flat-learning curve, just source the key-binding file and press CTRL-T.
As a side note / second argument: you should give fzf a try, it propels the fish experience to a whole new level of awesomeness :)

Ambrevar on 24 Dec 2016

I cannot reproduce with your ranger-cd function that includes echo commands. No matter how many echoes are in the function body. I've tried with zero and 5000 and many values in between. Obviously I am using a fish with my prior fix. It's still trivial to reproduce the problem without that fix.

As for fzf its signal management it is decidedly odd that some grep Sig /proc/$pid/task/$pid/status snapshots show it blocking so many signals. It is also catching every possible signal which is unusual but perhaps typical for a Go lang program. You can see this by doing grep '^9809 ' in the strace you provided. You can also see, at the end of the trace, that it receives the SIGHUP from the fish process, sets SIGHUP handling back to the default behavior, then sends itself a SIGHUP which terminates it.

The only way strace could be a factor is by altering the timing of the program being traced. Which also suggests a bug; specifically, a race condition.

My first attempt to install fzf told me it needed Ruby and the "gem" package. So I gave up. I took another look at it and realized that the Linux system I've been using to debug this did not have Go lang installed when I thought it did have Go installed. So I installed Go and upon reinstalling fzf it built it from Go source without requiring Ruby. I logged in with ssh, typed a few characters, pressed [ctrl-T], and found myself running fzf. I then did kill -9 $ssh_pid on the ssh client. The fzf process died as expected. I made multiple attempts to induce a spinning fzf without success.

I then reverted to the fish source prior to my fix. I still can't get fzf to spin.

@Ambrevar: I'm wondering if you're somehow putting a fzf program into the background. The reason I wonder that is up until commit 8d27f81a that I made on Dec 17 fish would leave background jobs running when it exits. Can you try using a fish built from the current git HEAD, @Ambrevar?

krader1961 on 25 Dec 2016

@Ambrevar: If I gave you a patch would you be able to build a custom fish and collect some data? While working on issue #2007 I noticed that fish blocks and unblocks all signals multiple times in the code path that launches jobs. It should be leaving all signals unblocked when fzf is launched. However, I'm wondering if that isn't true on your system, at least some of the time. That /proc/pid/status shows so many signal blocked some of the time is, as I've mentioned before, really unusual and an odd thing for a program like fzf to do. However, signals that are masked (blocked) at the time of the execve() call are left in that state when the new process starts running. So if fish is calling its signal_block() function but not signal_unblock() before the execve that launches fzf that would explain the behavior you're seeing.

krader1961 on 28 Dec 2016

Agreed that fzf keeping the old Ruby source is weird. But yeah, it has been a Go program for a couple of years now.

I tried with the latest fish commit:

> echo $version
2.4.0-277-g4cd34816

Did not help. I cannot see how I could inadvertently put fzf into the background. Wrong binding? I'm using Ctrl-Alt-t, but I also tried with Ctrl-t.

I can build a custom fish, no problem, send me your patch.

Ambrevar on 28 Dec 2016

The ranger spinning happened again. Maybe I have just been lucky so far.
I found other programs that trigger the same issue: cmus, ncdu...
They all share a common dependency: ncurses. I'm using the latest Arch Linux update, ncurses 6.0+20161224-1.

This also shows that fish functions have nothing to do with it as I am able to reproduce the issue by starting those programs directly.

Ambrevar on 30 Dec 2016

I don't think it will show anything but try the attached patch. If signals are blocked or SIGHUP is being ignored when execve() is called it will be reported. There might be a bug in fish. It's code for launching external programs is in dire need of being refactored. But I have to be honest and say that I think it is more likely to be due to something wrong with your system.

Cd to the top of the fish-shell source tree and apply with patch -p1 < patch.txt.
patch.txt

krader1961 on 30 Dec 2016

Your patch does not compile:

src/exec.cpp: In function ‘void check_blocked_signals()’:
src/exec.cpp:160:44: error: invalid cast from type ‘sigset_t {aka __sigset_t}’ to type ‘long int’
         format_long_safe(sigset_buf, (long)oset);
                                            ^~~~

On my system (Arch Linux), sigset_t is a structure:

typedef struct
  {
    unsigned long int __val[_SIGSET_NWORDS];
  } __sigset_t;

Ambrevar on 4 Jan 2017

Yeah, it's super annoying there is not standard way to extract groups of signals from a sigset_t as this is an implementation detail. In the case of GNU replace the (long)oset with oset.__val[0] on line 160.

krader1961 on 5 Jan 2017

OK, this compiles, but the resulting fish will hang after printing 1 or 2

signals are blocked when doing execve(): 0

(I waited 5s)

Ambrevar on 5 Jan 2017

Okay, you're not getting a stack dump because debug_shared() uses stdio functions which can deadlock on the child side of a fork. I'm perplexed why it's reporting zero for the first word of the blocked signals. However, since such things are implementation defined it's possible that on GNU libc the bits we're interested in are in the high order word of the sigset_t struct. The interesting thing is that the memcmp() is not returning zero which implies that signals are blocked when they shouldn't be.

In another window attach the debugger to the fish process using this as a template: gdb -p $pid /path/to/fish. Where $pid is the child fish process. Then type bt to get a backtrace. Also grep Sig /proc/$pid/status.

krader1961 on 5 Jan 2017

Backtrace before running anything

#0  0x00007f3e332e90b3 in select () from /usr/lib/libc.so.6
#1  0x0000000000492cd1 in input_common_readch(int) ()
#2  0x0000000000488247 in input_readch(bool) ()
#3  0x00000000004bd71a in reader_readline(int) ()
#4  0x00000000004c1aac in reader_read(int, io_chain_t const&) ()
#5  0x0000000000428d99 in main ()

Run fzf (which hangs as described above) then attach gdb and backtrace:

#0  0x00007f510e0eb0b3 in select () from /usr/lib/libc.so.6
#1  0x00000000004b444d in job_continue(job_t*, bool) ()
#2  0x000000000046c447 in exec_job(parser_t&, job_t*) ()
#3  0x00000000004a1970 in parse_execution_context_t::run_1_job(parse_node_t const&, block_t const*) ()
#4  0x00000000004a1d86 in parse_execution_context_t::run_job_list(parse_node_t const&, block_t const*) ()
#5  0x00000000004a1e1a in parse_execution_context_t::run_begin_statement(parse_node_t const&, parse_node_t const&) ()
#6  0x00000000004a2c6e in parse_execution_context_t::run_block_statement(parse_node_t const&) ()
#7  0x00000000004a2f8b in parse_execution_context_t::eval_node_at_offset(unsigned int, block_t const*, io_chain_t const&) ()
#8  0x00000000004ae2a6 in parser_t::eval_block_node(unsigned int, io_chain_t const&, block_type_t) ()
#9  0x000000000046a1d2 in ?? ()
#10 0x000000000046bdab in exec_job(parser_t&, job_t*) ()
#11 0x00000000004a1bfc in parse_execution_context_t::run_1_job(parse_node_t const&, block_t const*) ()
#12 0x00000000004a1d86 in parse_execution_context_t::run_job_list(parse_node_t const&, block_t const*) ()
#13 0x00000000004a3041 in parse_execution_context_t::eval_node_at_offset(unsigned int, block_t const*, io_chain_t const&) ()
#14 0x00000000004ae2a6 in parser_t::eval_block_node(unsigned int, io_chain_t const&, block_type_t) ()
#15 0x00000000004ae479 in parser_t::eval_acquiring_tree(std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&, io_chain_t const&, block_type_t, moved_ref<parse_node_tree_t>) ()
#16 0x00000000004b8bb0 in ?? ()
#17 0x00000000004c1d0a in reader_read(int, io_chain_t const&) ()
#18 0x0000000000430742 in ?? ()
#19 0x000000000042ebcc in builtin_run(parser_t&, wchar_t const* const*, io_streams_t&) ()
#20 0x000000000046b39c in exec_job(parser_t&, job_t*) ()
#21 0x00000000004a1bfc in parse_execution_context_t::run_1_job(parse_node_t const&, block_t const*) ()
#22 0x00000000004a1d86 in parse_execution_context_t::run_job_list(parse_node_t const&, block_t const*) ()
#23 0x00000000004a3041 in parse_execution_context_t::eval_node_at_offset(unsigned int, block_t const*, io_chain_t const&) ()
#24 0x00000000004ae2a6 in parser_t::eval_block_node(unsigned int, io_chain_t const&, block_type_t) ()
#25 0x00000000004ae479 in parser_t::eval_acquiring_tree(std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&, io_chain_t const&, block_type_t, moved_ref<parse_node_tree_t>) ()
#26 0x00000000004af674 in parser_t::eval(std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&, io_chain_t const&, b---Type <return> to continue, or q <return> to quit---
lock_type_t) ()
#27 0x000000000046a48f in ?? ()
#28 0x000000000046c3b3 in exec_job(parser_t&, job_t*) ()
#29 0x00000000004a1bfc in parse_execution_context_t::run_1_job(parse_node_t const&, block_t const*) ()
#30 0x00000000004a1d86 in parse_execution_context_t::run_job_list(parse_node_t const&, block_t const*) ()
#31 0x00000000004a3041 in parse_execution_context_t::eval_node_at_offset(unsigned int, block_t const*, io_chain_t const&) ()
#32 0x00000000004ae2a6 in parser_t::eval_block_node(unsigned int, io_chain_t const&, block_type_t) ()
#33 0x00000000004ae479 in parser_t::eval_acquiring_tree(std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&, io_chain_t const&, block_type_t, moved_ref<parse_node_tree_t>) ()
#34 0x00000000004af674 in parser_t::eval(std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&, io_chain_t const&, block_type_t) ()
#35 0x000000000046a48f in ?? ()
#36 0x000000000046c3b3 in exec_job(parser_t&, job_t*) ()
#37 0x00000000004a1bfc in parse_execution_context_t::run_1_job(parse_node_t const&, block_t const*) ()
#38 0x00000000004a1d86 in parse_execution_context_t::run_job_list(parse_node_t const&, block_t const*) ()
#39 0x00000000004a3041 in parse_execution_context_t::eval_node_at_offset(unsigned int, block_t const*, io_chain_t const&) ()
#40 0x00000000004ae326 in parser_t::eval_block_node(unsigned int, io_chain_t const&, block_type_t) ()
#41 0x00000000004ae479 in parser_t::eval_acquiring_tree(std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&, io_chain_t const&, block_type_t, moved_ref<parse_node_tree_t>) ()
#42 0x00000000004af674 in parser_t::eval(std::__cxx11::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&, io_chain_t const&, block_type_t) ()
#43 0x0000000000487fc2 in ?? ()
#44 0x00000000004883ec in input_readch(bool) ()
#45 0x00000000004bd71a in reader_readline(int) ()
#46 0x00000000004c1aac in reader_read(int, io_chain_t const&) ()
#47 0x0000000000428d99 in main ()

Before running anything:

> grep Sig /proc/$pid/status
SigQ:   0/15135
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000380004
SigCgt: 0000000188015013

After running fzf:

> grep Sig /proc/$pid/status
SigQ:   0/15135
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000006
SigCgt: 0000000188395011

Ambrevar on 5 Jan 2017

Thanks, @Ambrevar. I think we're getting close to figuring out what is wrong. Notice that the SigIgn mask you captured "before running anything" includes SIGQUIT (the low order 0x4). That is normal for fish as it explicitly ignores that signal at startup. What is not normal is the inclusion of SIGINT "after running fzf" (bit one in the low order 0x6). Fish only ignores SIGINT if it isn't interactive. So that is odd but can be ignored since this problem doesn't involve SIGINT.

The gdb backtrace suggests to me that you have defined a function to handle SIGHUP. What does trap -p report?

krader1961 on 7 Jan 2017

I don't think I have anything that handles SIGHUP. Nothing is returned by trap -p.

Ambrevar on 7 Jan 2017

For what it's worth, dmesg sometimes reports a segfault from fish. It seems to be corresponding to when the CPU hogs occur.

Various segfaults:

[ 1363.700961] fish[5480]: segfault at 7400000061 ip 0000007400000061 sp 00007fff5afd7f58 error 14
[ 1860.077668] fish[13510]: segfault at 0 ip 00007f90b904fd2e sp 00007ffd254c27b0 error 4 in libc-2.24.so[7f90b8fcd000+195000]
[ 1942.027020] fish[14114]: segfault at 7200000065 ip 00007f783a68af01 sp 00007ffde0a40cc0 error 6 in libc-2.24.so[7f783a61d000+195000]

Ambrevar on 7 Jan 2017

That there are segfaults (i.e., SIGSEGV errors) being reported is almost certainly relevant to this issue. Fish should never cause as segfault error. What does ulimit -c report for you? It is probably "0" which means you won't get a core dump. In which case run ulimit -c unlimited before inducing the problem. That should result in a core dump file. The location and name varies according to the OS. It might just be core in the cwd.

If you can obtain a core dump while also inducing the fzf spinning problem I think that is our best chance of making forward progress to resolve this issue. I say that because, in part, the earlier grep Sig /proc/$pid/status output you provided is not useful. That is because "before running anything" the $pid would be the parent process. Whereas I was interested in how signals were handled in the child process which, by definition, would not have the same PID (15135).

I greatly appreciate your assistance, @Ambrevar. But this is the type of problem that will only be resolved if a core developer can reproduce the problem. So anything you can tell us about your environment you think is relevant to this issue that would allow one of us to reproduce the problem is valuable.

krader1961 on 9 Jan 2017

Just got a hunch: I am running Awesome and closing my windows with the default binding Mod4+Shift+c (Mod4 being the Windows key in my case). If I run ncdu from xterm and close the window, the terminal remains open at first and prints:

There are still jobs active (use the jobs command to see them).

The output gets garbled. Some key presses later:

“ncdu” has stopped

> jobs
Job     Group   CPU     State   Command
1       7932    0%      stopped ncdu

Seems like you were right about the job thing.
Also note that I cannot reproduce with xterm. (And it does not segfault.)

Ambrevar on 9 Jan 2017

Stack trace from the core dump (with URxvt):

#0  0x00007fcd357f0d2e __strstr_sse2 (libc.so.6)
#1  0x00000000004bc090 _Z18reader_write_titleRKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEEb (fish)
#2  0x00000000004bc44a n/a (fish)
#3  0x00000000004bd56d _Z15reader_readlinei (fish)
#4  0x00000000004c19cc _Z11reader_readiRK10io_chain_t (fish)
#5  0x0000000000428e49 main (fish)
#6  0x00007fcd3578e291 __libc_start_main (libc.so.6)
#7  0x0000000000428f3a _start (fish)

Misuse of strstr?

Ambrevar on 9 Jan 2017

Another finding: I can reproduce in a TTY!

Ambrevar on 9 Jan 2017

Misuse of strstr?

Yes, if the tty has been closed (i.e., is no longer valid) then we'll be passing it a NULL pointer. And I think I know no why one else can reproduce this. Looking at your original comment I see you're using uxrvt which googling suggests sets TERM=rxvt-unicode. That TERM name isn't recognized which causes us to enter the block containing the strstr() that is blowing up. Very few people are going to be using a TERM value that isn't recognized so for most people that strstr() won't be executed. Fix should be ready in a few minutes.

krader1961 on 9 Jan 2017

👍1

@Ambrevar, Please try PR#3717.

krader1961 on 10 Jan 2017

Your fix removed the segfault that was always happening when the child process would be spinning. Thanks!
The bug is still present. Bummer! As you said, progress is impeded by core developers not being able to reproduce. urxvt+fish+ncurses seem to be the culprits here. Have you tried with URxvt? ncdu seems to be a great contender as an ncurses application since it requires zero setup and generates the issue very often.
Turns out there is at least one more segfault, that only happens every few child-spinning.

Stack trace of thread 21096:
#0  0x00007f38997a9f01 _IO_wfile_overflow (libc.so.6)
#1  0x00007f38997a5d1a fputwc (libc.so.6)
#2  0x00000000004bdc7c _Z15reader_readlinei (fish)
#3  0x00000000004c1a0c _Z11reader_readiRK10io_chain_t (fish)
#4  0x0000000000428e39 main (fish)
#5  0x00007f389975c291 __libc_start_main (libc.so.6)
#6  0x0000000000428f2a _start (fish)

EDIT: Both URxvt and TTY still generate the issue.
EDIT2: One more segfault cause:

Stack trace of thread 18053:
#0  0x00007f48d9b36efa __memcpy_sse2_unaligned_erms (libc.so.6)
#1  0x00007f48d9b213bb _IO_wfile_xsputn (libc.so.6)
#2  0x00007f48d9b1d3ee fputws (libc.so.6)
#3  0x00000000004b698c n/a (fish)
#4  0x00000000004c1c05 _Z11reader_readiRK10io_chain_t (fish)
#5  0x0000000000428e49 main (fish)
#6  0x00007f48d9ad3291 __libc_start_main (libc.so.6)
#7  0x0000000000428f3a _start (fish)

EDIT3: I inspected the 300 coredumps that systemd had carefully stored unbeknownst to me :p. I found other stack traces from December that might have been fixed in the mean time. The coredump obviously don't tell which version of fish it was, so only closer inspection will tell whether it's relevant or not.

Stack trace of thread 12147:
#0  0x00007fda2ef4aeb7 __gconv_transform_internal_utf8 (libc.so.6)
#1  0x00007fda2ef9465c do_out (libc.so.6)
#2  0x00007fda2ef93cfc _IO_wdo_write (libc.so.6)
#3  0x00007fda2ef94327 _IO_wfile_xsputn (libc.so.6)
#4  0x00007fda2ef7ad94 buffered_vfprintf (libc.so.6)
#5  0x00007fda2ef7828e vfwprintf (libc.so.6)
#6  0x00007fda2ef90e77 fwprintf (libc.so.6)
#7  0x000000000044ec66 _Z15show_stackframewii (fish)
#8  0x000000000046cbc3 _Z8exec_jobR8parser_tP5job_t (fish)
#9  0x00000000004a11cc _ZN25parse_execution_context_t9run_1_jobERK12parse_node_tPK7block_t (fish)
#10 0x00000000004a1356 _ZN25parse_execution_context_t12run_job_listERK12parse_node_tPK7block_t (fish)
#11 0x00000000004a2641 _ZN25parse_execution_context_t19eval_node_at_offsetEjPK7block_tRK10io_chain_t (fish)
#12 0x00000000004ad8c6 _ZN8parser_t15eval_block_nodeEjRK10io_chain_t12block_type_t (fish)
#13 0x00000000004ada99 _ZN8parser_t19eval_acquiring_treeERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t9mov
#14 0x00000000004aec94 _ZN8parser_t4evalERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t (fish)
#15 0x000000000046a1cf n/a (fish)
#16 0x000000000046c0f3 _Z8exec_jobR8parser_tP5job_t (fish)
#17 0x00000000004a11cc _ZN25parse_execution_context_t9run_1_jobERK12parse_node_tPK7block_t (fish)
#18 0x00000000004a1356 _ZN25parse_execution_context_t12run_job_listERK12parse_node_tPK7block_t (fish)
#19 0x00000000004a2641 _ZN25parse_execution_context_t19eval_node_at_offsetEjPK7block_tRK10io_chain_t (fish)
#20 0x00000000004ad946 _ZN8parser_t15eval_block_nodeEjRK10io_chain_t12block_type_t (fish)
#21 0x00000000004ada99 _ZN8parser_t19eval_acquiring_treeERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t9mov
#22 0x00000000004aec94 _ZN8parser_t4evalERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t (fish)
#23 0x00000000004bc83e _Z18reader_run_commandR8parser_tRKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEE (fish)
#24 0x00000000004c1089 _Z11reader_readiRK10io_chain_t (fish)
#25 0x00000000004291c9 main (fish)
#26 0x00007fda2ef46291 __libc_start_main (libc.so.6)
#27 0x00000000004292ba _start (fish)

Stack trace of thread 13507:
#0  0x00007fa23a5da343 __libc_fork (libc.so.6)
#1  0x00000000004b1e25 _Z12execute_forkb (fish)
#2  0x000000000046b72a _Z8exec_jobR8parser_tP5job_t (fish)
#3  0x00000000004a1484 _ZN25parse_execution_context_t9run_1_jobERK12parse_node_tPK7block_t (fish)
#4  0x00000000004a1686 _ZN25parse_execution_context_t12run_job_listERK12parse_node_tPK7block_t (fish)
#5  0x00000000004a2911 _ZN25parse_execution_context_t19eval_node_at_offsetEjPK7block_tRK10io_chain_t (fish)
#6  0x00000000004ada16 _ZN8parser_t15eval_block_nodeEjRK10io_chain_t12block_type_t (fish)
#7  0x00000000004adbe9 _ZN8parser_t19eval_acquiring_treeERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t9mov
#8  0x00000000004aede4 _ZN8parser_t4evalERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t (fish)
#9  0x000000000046a1de n/a (fish)
#10 0x000000000046c283 _Z8exec_jobR8parser_tP5job_t (fish)
#11 0x00000000004a14f4 _ZN25parse_execution_context_t9run_1_jobERK12parse_node_tPK7block_t (fish)
#12 0x00000000004a1686 _ZN25parse_execution_context_t12run_job_listERK12parse_node_tPK7block_t (fish)
#13 0x00000000004a2911 _ZN25parse_execution_context_t19eval_node_at_offsetEjPK7block_tRK10io_chain_t (fish)
#14 0x00000000004ada96 _ZN8parser_t15eval_block_nodeEjRK10io_chain_t12block_type_t (fish)
#15 0x00000000004adbe9 _ZN8parser_t19eval_acquiring_treeERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t9mov
#16 0x00000000004aede4 _ZN8parser_t4evalERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t (fish)
#17 0x0000000000487dd2 n/a (fish)
#18 0x00000000004881fc _Z12input_readchb (fish)
#19 0x00000000004bcbaa _Z15reader_readlinei (fish)
#20 0x00000000004c0f8c _Z11reader_readiRK10io_chain_t (fish)
#21 0x000000000042a2d9 main (fish)
#22 0x00007fa23a542291 __libc_start_main (libc.so.6)
#23 0x000000000042a3ca _start (fish)



md5-bf67133a08707dd1000b9ed1b50a210d



Stack trace of thread 1607:
#0  0x0000006f00000072 n/a (n/a)

Ambrevar on 10 Jan 2017

The fault in _IO_wfile_overflow() has to be a GNU libc bug. The fputwc(L'\n', stdout) call that fish makes which results in the failure is obviously correct.

Same for the fault in __memcpy_sse2_unaligned_erms(). That is reached via a fputws() call and the few in that fish module are, again, obviously correct. Note that a known glibc bug is very likely the source of these faults. Commit 396bf123 was merged Dec 14 to workaround that bug. It's possible there are other code paths that can trigger that glibc bug which need a similar workaround.

And I got quite a few empty traces... Any clue why?

No.

krader1961 on 10 Jan 2017

I installed rxvt-unicode and ncdu. I still can't reproduce this problem using those programs on Ubuntu 16.04. Which isn't too surprising since the evidence is that this is strongly dependent on a particular OS and GNU libc combination.

@Ambrevar, Please try PR #3724. Frankly, if that doesn't fix it for you I'm going to give up unless you can provide more information about the nature of the problem.

The stack that shows the last frame being __libc_fork also strongly suggests memory corruption as that should be a "can't happen" situation. Possibly due to the previously mentioned glibc bug involving I/O to a broken fd open on a tty. "Broken" in the sense that I/O to the fd results in EIO errors.

krader1961 on 11 Jan 2017

No it does not fix it :(
I'm running out of ideas too.
Well, thank you very much for trying so hard.

Ambrevar on 11 Jan 2017

Hmm, one more thing, maybe: I exec fish from my bashrc (bash remains my default shell because of the .profile and so on), if that would change anything.

Also when I use xterm and start any subprocess (not necessarily ncurses processes, e.g. sh) I get those (annoying) messages when I kill the terminal:

> sh
There are still jobs active (use the jobs command to see them).
A second attempt to exit will terminate them.
test: Missing argument at index 2
> exit
<E> fish: More than one job in foreground: job 1: “test (id -u) -eq 0” job 2: “id -u”
<E> fish: Errors detected, shutting down. Break on sanity_lose() to debug.
<E> fish: More than one job in foreground: job 1: “id -u” job 2: “sh”
<E> fish: Errors detected, shutting down. Break on sanity_lose() to debug.

Since it's not obvious from this output, here is what happens:

I start sh.
I kill the terminal.
The first 3 error lines are output.
I press any key.
The prompt appears ('>'), followed by exit (I did not enter that myself), followed by the 4 error lines.

Ambrevar on 11 Jan 2017

Okay, I can reliably (not every time but every 3rd time or so) reproduce a variety of crashes.

The last one looks like this:

Stack trace:

 Stack trace of thread 11902:
 #0  0x00007f85546fdefa __memcpy_sse2_unaligned_erms (libc.so.6)
 #1  0x00007f85546e83bb _IO_wfile_xsputn (libc.so.6)
 #2  0x00007f85546e43ee fputws (libc.so.6)
 #3  0x00000000004b664c n/a (fish)
 #4  0x00000000004c1815 _Z11reader_readiRK10io_chain_t (fish)
 #5  0x0000000000428c82 main (fish)
 #6  0x00007f855469a291 __libc_start_main (libc.so.6)
 #7  0x0000000000428d1a _start (fish)

and strace:

--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=11901, si_uid=1000} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0
getpgrp()                               = 11902
ioctl(0, TIOCSPGRP, [11902])            = 0
ioctl(0, TCGETS, {B38400 opost isig -icanon -echo ...}) = 0
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
--- SIGHUP {si_signo=SIGHUP, si_code=SI_KERNEL} ---
--- SIGCONT {si_signo=SIGCONT, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = 35025600
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo ...}) = -1 EIO (Input/output error)
fstat(2, {st_mode=S_IFCHR|0666, st_rdev=makedev(136, 3), ...}) = 0
write(2, "<", 1)                        = -1 EIO (Input/output error)
open("/usr/share/locale/de/LC_MESSAGES/libc.mo", O_RDONLY) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=155302, ...}) = 0
mmap(NULL, 155302, PROT_READ, MAP_PRIVATE, 6, 0) = 0x7f8555a60000
close(6)                                = 0
write(2, "<tcsetattr: Eing", 16)        = -1 EIO (Input/output error)
open("/proc/11943/stat", O_RDONLY|O_CLOEXEC) = 6
fcntl(6, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fstat(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(6, "11943 (ncdu) R 11902 11943 11902"..., 1024) = 312
close(6)                                = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x6700000150} ---
+++ killed by SIGSEGV (core dumped) +++

and the ncdu survives and goes nuts.

That's how it looks like in urxvt. In konsole, it looks a bit different. There's no crash as such (EDIT: I previously wrote "konsole kills fish on its own". I cannot substantiate that). It writes "Process 27600 did not die with SIGHUP", where 27600 is fish's PID. See the attached file for strace output since it's a bit long. One weird thing I've noticed is that it seems like we try to write the exit prompt there after the terminal has hung up, which doesn't seem helpful. Either way, there don't seem to be any harmful effects here.

faho on 11 Jan 2017

@faho, the strace output is fantastic. What it tells me is that we're writing an error message as a consequence of the ioctl() failing but before we've detected that the tty is now invalid and do the redirection to /dev/null to avoid the glibc bug. What OS are you using? And how are you inducing the failure?

krader1961 on 11 Jan 2017

What OS are you using?

Archlinux, like @Ambrevar.

And how are you inducing the failure?

Start urxvt, which starts the user's shell. In my case that's fish, so the bashrc "exec fish" isn't needed.

Then start ncdu in fish. It'll start "Scanning" the filesystem, and while it is doing so (this seems to be where a bit of timing is needed) kill urxvt (I actually close it via my WM, but I'm assuming that sends a SIGTERM anyway).

Note that there's a bunch of different call stacks I've gotten. but there's always an "fputw*" in there.

faho on 11 Jan 2017

I started to install Archlinux a few months ago but it doesn't have explicit support by VMware so I gave up since it was going to take more effort than it was worth to me. I can't reproduce this using macOS 10.12 or Ubuntu 16.04. I should probably try on Ubuntu 14.04 which I still have available.

I'll scan our code for all the places we might write an error after a failed ioctl or similar call and ensure we do the stdin/stdout redirect if errno is EIO.

krader1961 on 11 Jan 2017

I can't reproduce this using macOS 10.12 or Ubuntu 16.04. I should probably try on Ubuntu 14.04 which I still have available.

You'll probably want to go the opposite direction - Ubuntu 16.10. Arch pretty much always has newer software than Ubuntu, so with 14.04, you're moving further away version-wise.

Anyway, it seems fish, glibc and urxvt are involved here, so the versions of those are probably relevant. Fish is 2.4.0-335-gdcf3d03e, i.e. current master without #3724. glibc is 2.24 (really commit fdfc9260 plus one patch), urxvt is 9.22 without patches.

I'll scan our code for all the places we might write an error after a failed ioctl or similar call and ensure we do the stdin/stdout redirect if errno is EIO.

In this case, isn't the writing entirely superfluous since the terminal is already gone?

faho on 11 Jan 2017

In this case, isn't the writing entirely superfluous since the terminal is already gone?

Yes, but it's simpler and much cleaner to just do the I/O we would normally do while executing whatever code needs to run while we're exiting. We really don't want to introduce a global bool that we predicate all those writes on. Frankly, if this wasn't a glibc bug that lots of fish users will be impacted by for years to come I wouldn't bother working around it. I would just tell affected users to complain to the people responsible for the distro they're running. Sadly I don't think we can take the easy way out with respect to this glibc bug.

krader1961 on 11 Jan 2017

@Ambrevar and @faho: Please try PR #3724 again. I've added a commit which I'm hopeful will properly work around the glibc bug.

krader1961 on 12 Jan 2017

Good news: #3724 fixes the ncurses issue! ranger, htop, ncdu, none of seems triggers anymore.

Bad news: the function issue that was mentioned in the OP is still there! So it was a double-issue after all.
ranger-cd and fzf-file-widget will still fail, and presumably any ncurses app started from a function.

I still have those messages in xterm: exit is still being written, errors still get printed out. Can you shed some light on that?

Ambrevar on 12 Jan 2017

@krader1961: No joy. Apparently ENOTTY is also a concern here:

--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=12478, si_uid=1000} ---
--- SIGCONT {si_signo=SIGCONT, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], NULL, 8) = 0
getpgrp()                               = 12479
ioctl(0, TIOCSPGRP, [12479])            = -1 ENOTTY (Inappropriate ioctl for device)
fstat(2, {st_mode=S_IFCHR|0666, st_rdev=makedev(136, 6), ...}) = 0
write(2, "<", 1)                        = -1 EIO (Input/output error)
open("/usr/share/locale/de/LC_MESSAGES/libc.mo", O_RDONLY) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=155302, ...}) = 0
mmap(NULL, 155302, PROT_READ, MAP_PRIVATE, 6, 0) = 0x7f6d5b587000
close(6)                                = 0
write(2, "<tcsetpgrp: Unpa", 16)        = -1 EIO (Input/output error)
rt_sigprocmask(SIG_UNBLOCK, ~[RTMIN RT_1], NULL, 8) = 0
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig -icanon -echo ...}) = -1 EIO (Input/output error)
open("/dev/null", O_WRONLY)             = 6
ioctl(0, TCGETS, 0x7ffe7ea07a90)        = -1 EIO (Input/output error)
dup2(6, 0)                              = 0
ioctl(1, TCGETS, 0x7ffe7ea07a90)        = -1 EIO (Input/output error)
dup2(6, 1)                              = 1
ioctl(2, TCGETS, 0x7ffe7ea07a90)        = -1 EIO (Input/output error)
dup2(6, 2)                              = 2
close(6)                                = 0
write(2, "<tcsetpgrp: Unpa", 16)        = 16
write(2, "ssender IOCTL (I", 16)        = 16
write(2, "/O-Control) f\303\274r", 16)  = 16
write(2, " das Ger\303\244t\n<W> ", 16) = 16
write(2, "fish: Konnte Ter", 16)        = 16
write(2, "minalmodus f\303\274r ", 16)  = 16
write(2, "neue Shell nicht", 16)        = 16
write(2, " festlegen\n", 11)            = 11
write(2, "tcsetattr: Einga", 16)        = 16
write(2, "be-/Ausgabefehle", 16)        = 16
write(2, "r\n", 2)                      = 2
open("/proc/12724/stat", O_RDONLY|O_CLOEXEC) = 6
fcntl(6, F_GETFL)                       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
fstat(6, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(6, "12724 (ncdu) D 12479 12724 12479"..., 1024) = 309
close(6)                                = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x6100000160} ---
+++ killed by SIGSEGV (core dumped) +++

faho on 12 Jan 2017

Thanks, @faho. That reminded me why I was testing for errno == ENOTTY everywhere in my initial fix for this issue: tcsetpgrp() uses that errno. Whereas tcsetattr() uses EIO. I've pushed what I hope is the final change for the portion of this issue dealing with the tty becoming invalid.

Unless I find a simple fix for the problem with running fzf, etc., from a function I'm going to open a new issue since it's a best practice to have one distinct problem per issue.

krader1961 on 13 Jan 2017

@Ambrevar: It's possible the bug in my fix that I just pushed a change to correct will also fix the issue with running those programs from inside a function. Please test with the latest version of PR #3724.

I still have those messages in xterm: exit is still being written, errors still get printed out. Can you shed some light on that?

Not at this time. Those are errors reporting what amount to "can't happen" situations. You say you get them "when I kill the terminal". Which I can't make sense of since if you killed the terminal, and those messages are written to the terminal, how do you manage to see them? It might help us debug this if you do exec fish -d5 2>/tmp/fish_debug.txt then reproduce that issue. Also, since this is probably a problem distinct from needing to workaround the glibc stdio bug please open a new issue with your comment you made a day ago.

krader1961 on 13 Jan 2017

I've pushed what I hope is the final change for the portion of this issue dealing with the tty becoming invalid.

@krader1961: Yes, that works, and it indeed seems to fix both ncdu and ranger-cd.

faho on 13 Jan 2017

@fabo: are you sure? This does not seem to be fixed for me. Although the frequency seems somewhat dramatically reduced (maybe just luck). I get less segfaults when the CPU hogs happen. When the segfaults do happen, I get the same stack traces as mentioned before.

Ambrevar on 13 Jan 2017

@krader1961 Sorry for the confusion with xterm, I meant when I _try_ to close the terminal, it does not, and prints those messages instead. The shell is left in an unusable state. When I try to close it a second time, then it works.
I thought the issue might be related. @faho mentioned that exit is being written after the terminal was closed.

Ambrevar on 13 Jan 2017

are you sure? This does not seem to be fixed for me. Although the frequency seems somewhat dramatically reduced (maybe just luck). I get less segfaults when the CPU hogs happen. When the segfaults do happen, I get the same stack traces as mentioned before.

I don't get CPU hogs, and no segfaults. Neither with your original ranger-cd function (even augmented with about 1000 echo calls) nor with ncdu. What I get is fish and ranger killed by SIGHUP, as it should be.

I've now tested this about 30 times with either of the commands.

Sorry for the confusion with xterm, I meant when I try to close the terminal, it does not, and prints those messages instead. The shell is left in an unusable state. When I try to close it a second time, then it works.

We print that message when there are jobs left. Do you launch anything in the background? What does jobs say?

I thought the issue might be related.

It might be a case we're still missing.

faho on 13 Jan 2017

Weird you cannot reproduce... You are using Arch's stock glibc, right?

xterm: that's the thing, there is no job in the background!

> jobs
jobs: There are no jobs

It does this when some subprocess is running (like sh) and I try to kill the terminal.

Ambrevar on 13 Jan 2017

xterm: that's the thing, there is no job in the background!
It does this when some subprocess is running (like sh) and I try to kill the terminal.

Well, in that case there is a "job" running: That shell!

Of course what should happen is that we've received SIGHUP, not an interactive request to exit (via the exit builtin), so we should just pass the signal on and die.

This is probably caused by #3658, though I don't quite get why. This should be setting "exit_forced", in which case it shouldn't prompt.

The only way it's not is if it thinks the signal is caught by a function.

faho on 13 Jan 2017

Of course what should happen is that we've received SIGHUP, not an interactive request to exit (via the exit builtin), so we should just pass the signal on and die.

You know, thinking more about it I'm not 100% sure about that.

The weird thing is that with an inner fish, the outer one is usable. With an inner bash, any character pressed will kill it. Also, this only happens in xterm - konsole says something about "process $PID did not die with SIGHUP", urxvt just closes. A usable fish might be okay in this situation.

Anyway, since I'm not observing a _crash_ here, I suggest we track this in a different issue.

I'm going to retry getting ncdu/ranger-cd to go nuts.

(Edit: clarification)

faho on 13 Jan 2017

Note that ncdu and ranger by themselves run fine: they only start spinning when run from a function. I can reproduce this easily with

function ncdu-wrapper
    ncdu
    echo
    echo
    echo
    # ... I suppose you should add more echo if your machine is fast.
end

I cannot reproduce without echo (or maybe the odds or just too small). Question I don't think we have tackled this far: why the echos?
To me it sounds like the function goes on executing despite the SIGHUP, but if it terminates fast enough then everything shuts down properly for some reason.

Ambrevar on 13 Jan 2017

Or maybe not. I tried running this:

function ncdu-wrapper
    rm ~/foo.log ~/bar.log ~/baz.log
    ncdu
    echo foo >~/foo.log
    echo bar >~/bar.log
    echo baz >~/baz.log
    # ... I suppose you should add more echo if your machine is fast.
end

Upon spinning, none of the files got created. So I assume they did not get executed.

Ambrevar on 13 Jan 2017

Okay, I managed to get ncdu spinning again. Unfortunately I forgot to attach strace.

The dump shows:

#0  0x00007fe15efc7eb7 __gconv_transform_internal_utf8 (libc.so.6)
#1  0x00007fe15f01165c do_out (libc.so.6)
#2  0x00007fe15f010cfc _IO_wdo_write (libc.so.6)
#3  0x00007fe15f011327 _IO_wfile_xsputn (libc.so.6)
#4  0x00007fe15eff7d94 buffered_vfprintf (libc.so.6)
#5  0x00007fe15eff528e vfwprintf (libc.so.6)
#6  0x00007fe15f00de77 fwprintf (libc.so.6)
#7  0x000000000044eb96 _Z15show_stackframewii (fish)
#8  0x000000000046ceba _Z8exec_jobR8parser_tP5job_t (fish)
#9  0x00000000004a1bf4 _ZN25parse_execution_context_t9run_1_jobERK12parse_node_tPK7block_t (fish)
#10 0x00000000004a1d86 _ZN25parse_execution_context_t12run_job_listERK12parse_node_tPK7block_t (fish)
#11 0x00000000004a3028 _ZN25parse_execution_context_t19eval_node_at_offsetEjPK7block_tRK10io_chain_t (fish)
#12 0x00000000004ae2a6 _ZN8parser_t15eval_block_nodeEjRK10io_chain_t12block_type_t (fish)
#13 0x00000000004ae479 _ZN8parser_t19eval_acquiring_treeERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t9moved_refI17parse_node_tree_tE (fish)
#14 0x00000000004af674 _ZN8parser_t4evalERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t (fish)
#15 0x000000000046a54f n/a (fish)
#16 0x000000000046c483 _Z8exec_jobR8parser_tP5job_t (fish)
#17 0x00000000004a1bf4 _ZN25parse_execution_context_t9run_1_jobERK12parse_node_tPK7block_t (fish)
#18 0x00000000004a1d86 _ZN25parse_execution_context_t12run_job_listERK12parse_node_tPK7block_t (fish)
#19 0x00000000004a3028 _ZN25parse_execution_context_t19eval_node_at_offsetEjPK7block_tRK10io_chain_t (fish)
#20 0x00000000004ae326 _ZN8parser_t15eval_block_nodeEjRK10io_chain_t12block_type_t (fish)
#21 0x00000000004ae479 _ZN8parser_t19eval_acquiring_treeERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t9moved_refI17parse_node_tree_tE (fish)
#22 0x00000000004af674 _ZN8parser_t4evalERKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEERK10io_chain_t12block_type_t (fish)
#23 0x00000000004ba0ae _Z18reader_run_commandR8parser_tRKNSt7__cxx1112basic_stringIwSt11char_traitsIwESaIwEEE (fish)
#24 0x00000000004c1b5a _Z11reader_readiRK10io_chain_t (fish)
#25 0x0000000000428f25 main (fish)
#26 0x00007fe15efc3291 __libc_start_main (libc.so.6)
#27 0x000000000042902a _start (fish)

That "show_stackframe" is only run when a builtin is run without redirection and exits before its output is done.

So that would explain why the "echo" is necessary, and also why the redirection fixes it.

Edit: For this particular crash, an unredirected echo is necessary. No idea why it's being run though.

faho on 13 Jan 2017

OK, now you are having all the stack traces I have also had. (See my previous comments.) They were different conditions and different bugs indeed.

Ambrevar on 13 Jan 2017

The show_stackframe() invocation in src/exec.cpp shown in @faho's backtrace was introduced by commit 640118e7 from ridiculousfish way back on 2013-06-16. If you do a git blame src/exec.cpp now it will suggest I introduced that code. However, I simply modified the show_stackframe() call to make the message level explicit. It's a known problem that show_stackframe() is not safe to call from the child side of fork() and can deadlock or die.

It is reasonably clear that working around the glibc bug writing to an invalid stdout descriptor has been fixed. So I'm going to open a new issue to track the problem with running commands like fzf and ncdu from a function then close this issue. I'd be thrilled if @Ambrevar or @faho beat me to it 😄

krader1961 on 14 Jan 2017

I'm closing this because there are at least two bugs and we've fixed the main one in this issue. That is, worked around the GNU libc bug that makes writing to a fd open on a tty in the EIO state cause unpredictable behavior. I've opened issue #3737 to track resolving why another failure mode seems to occur when running fzf and ncdu from a function when the tty becomes an invalid.

krader1961 on 16 Jan 2017

FWIW, it looks like the last stack trace shown by @faho is another manifestation of the glibc bug. I just merged commit d905ed3 to workaround that particular failure. I was wrong in asserting in my previous comment that we were running on the child side of the fork() in that case.

Having said that I don't see how the glibc bug could cause the behavior documented in @Ambrevar's comment four days ago. So I still intend to pursue that in issue #3737.

krader1961 on 16 Jan 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings