Describe the bug
The container restarts frequently when the api server is not responding but it should not have to do it.
Version of Helm and Kubernetes:
helm 2.14.1
kubernetes 1.12.7-1
Which chart:
latest stable/nfs-server-provisioner
What happened:
I have a cluster with latency and sometimes failure issues of the api server. I have submitted a bug to my cloud provider that manages the control plane.
My issue with nfs-server-provisioner is that during periods of failures of the api server I seel logs like this and the container restarts :
I0619 07:05:11.965197 1 leaderelection.go:231] failed to renew lease nfs/cluster.local-nfs-data-nfs-server-provisioner: failed to tryAcquireOrRenew context deadline exceeded
F0619 07:05:11.966172 1 controller.go:646] leaderelection lost
I have lots of problems with the PVCs based on this provisionner: binding timeouts, lost file writes, etc.
I am not certain but I think it is related to these frequent restarts of the nfs server.
What you expected to happen:
The nfs server should tolerate api server failures.
How to reproduce it (as minimally and precisely as possible):
I do not manage the control plane of my cluster. But I suppose that manually stopping the api server should recreate the issue.
Anything else we need to know:
I understand restarting the container in this case is probably the expected behavior, but I think it is wrong.
@albanm I'm seeing a very similar issue - which cloud provider are you seeing the latency with?
It is OVH. Their managed kubernetes solution is quite recent.
I have published a temporary fork of nfs-provisioner on docker hub to deactivate the leader election system in the meantime. If you wish you can test it with these options for the helm chart:
image:
repository: koumoul/nfs-provisioner
tag: v1.0.0
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.
This issue is being automatically closed due to inactivity.
Hey, I have seen the issue described above in multiple clusters. More recently on IBM and AWS, both restart the pod, and the last logs you can read:
I0810 18:47:10.541004 1 leaderelection.go:231] failed to renew lease default/worldsibu.com-nfs: failed to tryAcquireOrRenew context deadline exceeded
F0810 18:47:10.541039 1 controller.go:646] leaderelection lost
@albanm do you happen to have the changes performed on a git repo?
The image is here: https://hub.docker.com/r/koumoul/nfs-provisioner/tags
The fork is here: https://github.com/koumoul-dev/external-storage
There are 5 commits by me, but 4 of them are only about build. The fix commit is this one: https://github.com/koumoul-dev/external-storage/commit/aa1869b605c6944f271df351e908d4142deac0d0
Hey @albanm, thank you very much for the links. I'll test them on my clusters.
Hi @albanm , do you have update your fork with new nfs-provisioner code and k8s compatibility?
No, I didn't do that. I suppose I should do a merge once in a while. I haven't encountered problems yet.
Hi @albanm I just had this issue. I guess the problem has persisted still in the latest version?
Honestly I didn't check. I still use my fork without asking question as it works well for me.
@albanm The official one has been updated 4 months ago, while your image was updated one year ago if I am not mistaken. Are there any other differences besides the leader election thing? Thanks
No, the only meaningful commit of my fork is this one https://github.com/koumoul-dev/external-storage/commit/aa1869b605c6944f271df351e908d4142deac0d0
Most helpful comment
The image is here: https://hub.docker.com/r/koumoul/nfs-provisioner/tags
The fork is here: https://github.com/koumoul-dev/external-storage
There are 5 commits by me, but 4 of them are only about build. The fix commit is this one: https://github.com/koumoul-dev/external-storage/commit/aa1869b605c6944f271df351e908d4142deac0d0