K9s: Occasional hang requiring kill -9

Created on 22 Apr 2019 · 25Comments · Source: derailed/k9s

Occasionally, k9s on Linux Fedora 4.20.16-200.fc29.x86_64 hangs. I am unable to input any data into k9s, nor even use Ctrl-C. At this point I have to use kill -9 from another terminal, at which point I can then start k9s again. The terminal itself is not frozen because once the kill -9 is done, the terminal shows k9s as killed, and k9s can be run again in the same terminal.

I don't know how to reproduce this consistently. If there are some debugging steps I can take the next time this freeze happens, let me know. Perhaps a gdb thread dump?

Versions (please complete the following information):

OS: Fedora 26, kernel 4.20.16-200.fc29.x86_64, terminal is Konsole
K9s [e.g. 0.1.0]: 0.5.0 (currently running 0.5.1, will see if it happens again)
K8s [e.g. 1.11.0]: 1.12.4

bug

Source

rocketraman

👀1

All 25 comments

I'm experiencing the same, it usually happens to me, when trying to check the logs for a pod.

OS: Manjaro Linux
K9s Rev: 0.5.1 
K8s Rev: v1.11.8-eks-7c34c0

ncsibra on 23 Apr 2019

@rocketraman Thank you for the report! It would be useful if you could note what kind of interactions you were performing when this happen. FYI when it does happen your terminal session is not toast as in most cases typing reset should correct your session. I'll add some instrumentation so we can track this down.

derailed on 23 Apr 2019

Same as @ncsibra it does seem to happen most often when having viewed logs. But as I recall, it does not always freeze while the actual view log operation is in progress.

rocketraman on 23 Apr 2019

Extra info, it can hang multiple times on the same resource, after a kill and restart, but, if I first open the container view with enter, then check the logs of the container, not the pod itself, it works every time.
The last pod where I faced this issue had only one container.

ncsibra on 25 Apr 2019

I was able to hang 0.5.1 by doing the opposite (I think) of @ncsibra 's comment -- I hit enter on a pod to see it's individual containers, and then opening logs via l on the first (and only) container froze k9s immediately. However, when I tried it again on exactly the same pod and container after restarting k9s, it worked, so I still cannot reproduce this consistently.

rocketraman on 26 Apr 2019

Ok, I'm actually able to reproduce this consistently by going to a pod or a container, doesn't matter which, and hitting l, esc, l, esc, a bunch of times. Eventually, usually after 2-4 iterations, it hangs.

rocketraman on 26 Apr 2019

@rocketraman @ncsibra Thank you both for the additional details on this! I did manage to repro tho it took more than 2-4 iterations but was able to get to a screen freeze via l+.

I think I have a fix in 0.5.2. Please help me verify. Closing for now. Tx!

derailed on 27 Apr 2019

@derailed I built tag 0.5.2 and was still able to make it hang. Took a bit longer but still happened. Same repro.

rocketraman on 27 Apr 2019

@derailed I can confirm, the issue still exists.
I cloned your repo and reproduced with pprof enabled, looks like you have a deadlock, your update queue is full in the tview lib, refresh unable to finish, because blocked on chan send, so holds the update lock, therefore the switchPage method unable to lock it.
I don't know why your queue is full, your code is rather complex, I should spend much more time to familiarize myself with it.
Here's the stack dump and block profile, created with runtime.SetBlockProfileRate(1), I hope this helps.
pprof.tar.gz

ncsibra on 27 Apr 2019

@rocketraman @ncsibra Thanks for reporting back! Not keen on this repro scenario, as it stems on ab-using vs using K9s ;( In this case though, this revealed a much deeper issue, which took me on a wild ride down the rabbit hole. @ncsibra Thank you for your insight! You are correct, tview uses an update buffered channel which the log view totally hammers. I've found several issues with the current code base and hopefully will provide a viable fix (this time for sure) under the more common use scenario ;)

derailed on 29 Apr 2019

Thanks for reporting back! Not keen on this repro scenario, as it stems on ab-using vs using K9s

Keep in mind, the repro is just a repro. After using the pre-release 0.5.2 for a bit, it did hang under normal use, and it seemed to happen more often than 0.5.1.

Glad to hear you've found some issues though!

rocketraman on 29 Apr 2019

Thanks @rocketraman! Hope this will do it... 0.6.0

derailed on 29 Apr 2019

Haven't noticed any issues with 0.6.0 so far! BTW, happy birthday! :-)

rocketraman on 29 Apr 2019

Ah! Thank you kindly @rocketraman!!
Boy am I glad to hear this. Awesome and thank you for reporting back.

derailed on 29 Apr 2019

@derailed Oh oh, spoke too soon... just experienced another hang. It was just a "use" scenario, not an "ab-use" one -- this time it happened when exiting the shell of a pod.

rocketraman on 29 Apr 2019

@rocketraman Rats! you were supposed to come bearing gifts not bugs... Great catch! Thank you for reporting this. I'll push a fix in the next drop 0.6.1. Hopefully this time for sure...

derailed on 30 Apr 2019

@derailed I had another hang in 0.6.1, which seems to be a slightly different case, so I'm happy to open a new issue... my trusty old switch died and dropped my networking while I was away on business. When I got back, I found my active k9s session was hung, again requiring a kill -9. As this happened while I was away, I don't know if the hang was caused by the network interruption, or by something else, but 0.6.1 definitely hung again!

rocketraman on 3 May 2019

@rocketraman Thank for reporting this! No worries might as well try to flush these out. K9s should be more resilient even if the cable is pulled. I'll take a look.

derailed on 4 May 2019

@rocketraman Fixed 0.6.7!

derailed on 25 May 2019

👍1

Super great tool, thanks!

I've been getting what appears to be this freeze with 0.20.3.

Anything I can do to help debug?

jkleckner on 5 Jun 2020

I've had 0.20.3 also freeze a few times, will upgrade to the latest version and report!

Great tool overall!

nonsense on 5 Jun 2020

👍1

@jkleckner @nonsense Thank you both for your kindness! Right I had resilient issues in the last drop regarding locking. Please given 0.20.5 a rinse and let me know if we're happier... Tx!!

derailed on 5 Jun 2020

@derailed Thanks, it hasn't happened to me so far since upgrading.

jkleckner on 9 Jun 2020

@derailed I'm seeing hangs on 0.20.5 when doing logs for deployments with multiple pods.

rocketraman on 19 Jun 2020