Occasionally, k9s on Linux Fedora 4.20.16-200.fc29.x86_64 hangs. I am unable to input any data into k9s, nor even use Ctrl-C. At this point I have to use kill -9 from another terminal, at which point I can then start k9s again. The terminal itself is not frozen because once the kill -9 is done, the terminal shows k9s as killed, and k9s can be run again in the same terminal.
I don't know how to reproduce this consistently. If there are some debugging steps I can take the next time this freeze happens, let me know. Perhaps a gdb thread dump?
Versions (please complete the following information):
I'm experiencing the same, it usually happens to me, when trying to check the logs for a pod.
OS: Manjaro Linux
K9s Rev: 0.5.1
K8s Rev: v1.11.8-eks-7c34c0
@rocketraman Thank you for the report! It would be useful if you could note what kind of interactions you were performing when this happen. FYI when it does happen your terminal session is not toast as in most cases typing reset should correct your session. I'll add some instrumentation so we can track this down.
Same as @ncsibra it does seem to happen most often when having viewed logs. But as I recall, it does not always freeze while the actual view log operation is in progress.
Extra info, it can hang multiple times on the same resource, after a kill and restart, but, if I first open the container view with enter, then check the logs of the container, not the pod itself, it works every time.
The last pod where I faced this issue had only one container.
I was able to hang 0.5.1 by doing the opposite (I think) of @ncsibra 's comment -- I hit enter on a pod to see it's individual containers, and then opening logs via l on the first (and only) container froze k9s immediately. However, when I tried it again on exactly the same pod and container after restarting k9s, it worked, so I still cannot reproduce this consistently.
Ok, I'm actually able to reproduce this consistently by going to a pod or a container, doesn't matter which, and hitting l, esc, l, esc, a bunch of times. Eventually, usually after 2-4 iterations, it hangs.
@rocketraman @ncsibra Thank you both for the additional details on this! I did manage to repro tho it took more than 2-4 iterations but was able to get to a screen freeze via l+
I think I have a fix in 0.5.2. Please help me verify. Closing for now. Tx!
@derailed I built tag 0.5.2 and was still able to make it hang. Took a bit longer but still happened. Same repro.
@derailed I can confirm, the issue still exists.
I cloned your repo and reproduced with pprof enabled, looks like you have a deadlock, your update queue is full in the tview lib, refresh unable to finish, because blocked on chan send, so holds the update lock, therefore the switchPage method unable to lock it.
I don't know why your queue is full, your code is rather complex, I should spend much more time to familiarize myself with it.
Here's the stack dump and block profile, created with runtime.SetBlockProfileRate(1), I hope this helps.
pprof.tar.gz
@rocketraman @ncsibra Thanks for reporting back! Not keen on this repro scenario, as it stems on ab-using vs using K9s ;( In this case though, this revealed a much deeper issue, which took me on a wild ride down the rabbit hole. @ncsibra Thank you for your insight! You are correct, tview uses an update buffered channel which the log view totally hammers. I've found several issues with the current code base and hopefully will provide a viable fix (this time for sure) under the more common use scenario ;)
Thanks for reporting back! Not keen on this repro scenario, as it stems on ab-using vs using K9s
Keep in mind, the repro is just a repro. After using the pre-release 0.5.2 for a bit, it did hang under normal use, and it seemed to happen more often than 0.5.1.
Glad to hear you've found some issues though!
Thanks @rocketraman! Hope this will do it... 0.6.0
Haven't noticed any issues with 0.6.0 so far! BTW, happy birthday! :-)
Ah! Thank you kindly @rocketraman!!
Boy am I glad to hear this. Awesome and thank you for reporting back.
@derailed Oh oh, spoke too soon... just experienced another hang. It was just a "use" scenario, not an "ab-use" one -- this time it happened when exiting the shell of a pod.
@rocketraman Rats! you were supposed to come bearing gifts not bugs... Great catch! Thank you for reporting this. I'll push a fix in the next drop 0.6.1. Hopefully this time for sure...
@derailed I had another hang in 0.6.1, which seems to be a slightly different case, so I'm happy to open a new issue... my trusty old switch died and dropped my networking while I was away on business. When I got back, I found my active k9s session was hung, again requiring a kill -9. As this happened while I was away, I don't know if the hang was caused by the network interruption, or by something else, but 0.6.1 definitely hung again!
@rocketraman Thank for reporting this! No worries might as well try to flush these out. K9s should be more resilient even if the cable is pulled. I'll take a look.
@rocketraman Fixed 0.6.7!
Super great tool, thanks!
I've been getting what appears to be this freeze with 0.20.3.
Anything I can do to help debug?
I've had 0.20.3 also freeze a few times, will upgrade to the latest version and report!
Great tool overall!
@jkleckner @nonsense Thank you both for your kindness! Right I had resilient issues in the last drop regarding locking. Please given 0.20.5 a rinse and let me know if we're happier... Tx!!
@derailed Thanks, it hasn't happened to me so far since upgrading.
@derailed I'm seeing hangs on 0.20.5 when doing logs for deployments with multiple pods.
I'm seeing hangs on 0.20.5 when doing logs for deployments with multiple pods.
@rocketraman Note that at least in 0.21.0 this doesn't require a kill -9 but bails out as described in #790.