When default StorageClass is configured to have volumeBindingMode set to WaitForFirstConsumer, workspaces do not start. I guess that's because che is waiting for PVC to be in "Bound" state before creating workspace (or mkdir) pod. But this is not happening with this volumeBindingMode and the PVC is stuck in "Pending" state with message "waiting for first consumer to be created before binding" until the workspace startup fails.
1) Create default StorageClass with WaitForFirstConsumer volumeBindingMode (more info here)
2) Try to start workspace
Workspace startup fails with message:
Error: Failed to run the workspace: "Waiting for persistent volume claim 'claim-che-workspace' reached timeout"
And error log in che-master:
2019-03-14 14:06:31,927[aceSharedPool-0] [ERROR] [o.e.c.a.w.s.WorkspaceRuntimes 813] - Waiting for persistent volume claim 'claim-che-workspace' reached timeout
org.eclipse.che.api.workspace.server.spi.InternalInfrastructureException: Waiting for persistent volume claim 'claim-che-workspace' reached timeout
at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesPersistentVolumeClaims.wait(KubernetesPersistentVolumeClaims.java:225)
at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.KubernetesPersistentVolumeClaims.waitBound(KubernetesPersistentVolumeClaims.java:165)
at org.eclipse.che.workspace.infrastructure.kubernetes.namespace.pvc.CommonPVCStrategy.prepare(CommonPVCStrategy.java:200)
at org.eclipse.che.workspace.infrastructure.kubernetes.KubernetesInternalRuntime.internalStart(KubernetesInternalRuntime.java:200)
at org.eclipse.che.api.workspace.server.spi.InternalRuntime.start(InternalRuntime.java:141)
at org.eclipse.che.api.workspace.server.WorkspaceRuntimes$StartRuntimeTask.run(WorkspaceRuntimes.java:779)
at org.eclipse.che.commons.lang.concurrent.CopyThreadLocalRunnable.run(CopyThreadLocalRunnable.java:38)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
OS and version:
This is reproducible with Codeready Workspaces on OCP 4 Beta and on latest Eclipse Che on pure kubernetes (minikube)
@sleshchenko, @skabashnyuk: do you know how much time fixup could take?
@dmytro-ndp Like a couples of hours
Here is implemented a workaround that provides an ability to disable waiting for PVCs manually with configuration property https://github.com/eclipse/che/pull/13409.
But a solution with good UX would be something like: Che Server should detect volumeBindingMode=WaitForFirstConsumer itself and do not wait for PVCs to be bound in such case.
I'm not familiar with WaitForFirstConsumer volume binding mode, and maybe Che Server has no permissions to check it at all, then before waiting for PVCs, Che Server may start tooling pod that would request PVCs, wait for PVCs to be bound and only then create all other workspace pods.
BTW some investigation is needed to understand the final solution of this issue.
Now, there a bit more info about this issue I would like to share:
Initially, we used not to wait for PVCs to be bound after creating but then we faced an issue on slow OpenShift cluster (more see https://github.com/eclipse/che/issues/11848).
So, we introduced waiting to PVCs and it costs like nothing on fast OpenShift installation and helps slow installations.
Now, we faced an issue with WaitForFirstConsumer volume binding mode and waiting for PVCs to be bound.
I've checked and now know for sure that Che Server is not able to check volumeBindingMode of source storage that will be used for PVCs(default or configured) without cluster admin role which is not available to it.
Possible solutions we can move with:
false (do not wait) default value, it should work fine for all fast enough OpenShift installations (fast enough is an abstract term, maybe some investigation should be done).FailedScheduling issue and provides a configuration that should be applied on Che on such slow K8s/OpenShift installation (enable waiting for PVC to be bound)WaitForFirstConsumer works fine without any additional Che configuration in case if OpenShift installation is not slow.waitForFirstConsumer by itself.it may not work for slow installation as Eugene had, since as he mentioned
This cannot be fixed by removing this event from unrecoverable events in Che conf, since the error happens on k8s level - Che is attempting to create a deployment with a pod spec that uses unbound pvc.
But I did not investigate this topic by myself
WaitForFirstConsumer is not configured. I assume that it's like a couple of seconds if an image is already pulled.@l0rd @slemeur I would like to hear your opinion about it, do you think the first way is good enough, or we invest time on the second solution before GA?
IMHO, lets have CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND=false by default with as much as possible clear error message, may be with advise to set it to true, and we will check it carefully against OCP before release of CRW 1.2.
I would go with CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND=true to have this behavior by default on che installations (a.k.a. don't change anything "by default").
And on CRW side (for 1.2 release) I think we should implement logic into operator which would inspect default StorageClass (operator has cluster-admin rights, so it should be able to see if WaitForFirstConsumer is configured) and based on that it will set CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND property to true/false. WDYT @davidfestal ?
For Che 7/CRW 2 it would be nice to have this implemented in Che Server, but I don't have any strong opinion on the proposed solution with the "artifical" pod.
I think that setting CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND=true as suggested by @rhopp is the way to go in the short term.
Using an init container looks like overkill. And using the operator to detect the Volume Binding Mode doesn't look simple neither: theoretically the wsmaster and the workspace pods can be bound to PV with different volume binding modes (che.osio is an example we all know).
Something that I don't understand is why we can't just infer that we are in WaitForFirstConsumer mode at runtime if we intercept an event message that says waiting for first consumer to be created before binding.
Something that I don't understand is why we can't just infer that we are in WaitForFirstConsumer mode at runtime if we intercept an event message that says
waiting for first consumer to be created before binding.
I missed this event. Checking an event message is not reliable but I like this proposal and think it will improve Che Server behavior:+1:
Now, Che Server may be configured not to wait for PVCs to be bound with configuration property and it unlocks Che on installations where waitForFirstConsumer PV binding mode is configured.
But there is another issue[1] to improve PVC waiting process and additionally to checking of PVC status, listen to PVC related events with message waiting for the first consumer to be created before binding.
It should make unneeded reconfiguring Che Server in case of waitForFirstConsumer.
Where is CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND set? I cannot find it in Configmap.
@bryantson It should not be needed after https://github.com/eclipse/che/pull/14239.
BTW Default value is in che.properties file that is bundled to Che Server, and you can override it by providing CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND value to config map
Most helpful comment
I think that setting
CHE_INFRA_KUBERNETES_PVC_WAIT__BOUND=trueas suggested by @rhopp is the way to go in the short term.Using an init container looks like overkill. And using the operator to detect the Volume Binding Mode doesn't look simple neither: theoretically the wsmaster and the workspace pods can be bound to PV with different volume binding modes (che.osio is an example we all know).
Something that I don't understand is why we can't just infer that we are in
WaitForFirstConsumermode at runtime if we intercept an event message that sayswaiting for first consumer to be created before binding.