Charts: [stable/nfs-server-provisioner] container restarts when api server is failing

Created on 19 Jun 2019  路  14Comments  路  Source: helm/charts

Describe the bug

The container restarts frequently when the api server is not responding but it should not have to do it.

Version of Helm and Kubernetes:

helm 2.14.1
kubernetes 1.12.7-1

Which chart:

latest stable/nfs-server-provisioner

What happened:

I have a cluster with latency and sometimes failure issues of the api server. I have submitted a bug to my cloud provider that manages the control plane.
My issue with nfs-server-provisioner is that during periods of failures of the api server I seel logs like this and the container restarts :

I0619 07:05:11.965197       1 leaderelection.go:231] failed to renew lease nfs/cluster.local-nfs-data-nfs-server-provisioner: failed to tryAcquireOrRenew context deadline exceeded
F0619 07:05:11.966172       1 controller.go:646] leaderelection lost

I have lots of problems with the PVCs based on this provisionner: binding timeouts, lost file writes, etc.
I am not certain but I think it is related to these frequent restarts of the nfs server.

What you expected to happen:

The nfs server should tolerate api server failures.

How to reproduce it (as minimally and precisely as possible):

I do not manage the control plane of my cluster. But I suppose that manually stopping the api server should recreate the issue.

Anything else we need to know:

I understand restarting the container in this case is probably the expected behavior, but I think it is wrong.

  • replication is not relevant in the case of nfs-server-provisioner and I don't see the point of a leader election
  • stability issues of the api server should not propagate to a storage provisioner and then to a bunch of volumes used by a bunch of services
lifecyclstale

Most helpful comment

The image is here: https://hub.docker.com/r/koumoul/nfs-provisioner/tags

The fork is here: https://github.com/koumoul-dev/external-storage

There are 5 commits by me, but 4 of them are only about build. The fix commit is this one: https://github.com/koumoul-dev/external-storage/commit/aa1869b605c6944f271df351e908d4142deac0d0

All 14 comments

@albanm I'm seeing a very similar issue - which cloud provider are you seeing the latency with?

It is OVH. Their managed kubernetes solution is quite recent.

I have published a temporary fork of nfs-provisioner on docker hub to deactivate the leader election system in the meantime. If you wish you can test it with these options for the helm chart:

image:
  repository: koumoul/nfs-provisioner
  tag: v1.0.0

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

This issue is being automatically closed due to inactivity.

Hey, I have seen the issue described above in multiple clusters. More recently on IBM and AWS, both restart the pod, and the last logs you can read:

I0810 18:47:10.541004       1 leaderelection.go:231] failed to renew lease default/worldsibu.com-nfs: failed to tryAcquireOrRenew context deadline exceeded
F0810 18:47:10.541039       1 controller.go:646] leaderelection lost

@albanm do you happen to have the changes performed on a git repo?

The image is here: https://hub.docker.com/r/koumoul/nfs-provisioner/tags

The fork is here: https://github.com/koumoul-dev/external-storage

There are 5 commits by me, but 4 of them are only about build. The fix commit is this one: https://github.com/koumoul-dev/external-storage/commit/aa1869b605c6944f271df351e908d4142deac0d0

Hey @albanm, thank you very much for the links. I'll test them on my clusters.

Hi @albanm , do you have update your fork with new nfs-provisioner code and k8s compatibility?

No, I didn't do that. I suppose I should do a merge once in a while. I haven't encountered problems yet.

Hi @albanm I just had this issue. I guess the problem has persisted still in the latest version?

Honestly I didn't check. I still use my fork without asking question as it works well for me.

@albanm The official one has been updated 4 months ago, while your image was updated one year ago if I am not mistaken. Are there any other differences besides the leader election thing? Thanks

Was this page helpful?
0 / 5 - 0 ratings