Kops: AWS - multiple AZ cluster doesn't cooperate with PVC from different AZ

Created on 24 Dec 2018 · 6Comments · Source: kubernetes/kops

Hi I have problem with kubernetes cluster created with kops
I'm using this command to create cluster on AWS:

kops create cluster \
--bastion="true" \
--master-count=3 \
--node-count=4 \
--master-zones eu-west-1a,eu-west-1b,eu-west-1c \
--zones eu-west-1a,eu-west-1b,eu-west-1c \
--node-size t2.medium \
--master-size t2.small \
--dns private \
--dns-zone MY_DNS_ZONE \
--vpc MY_VPC \
--topology private \
--networking kopeio-vxlan \
--target=terraform \
${NAME}

Everything is working fine before I try to create PVC on EBS.
My pods are in "Pending" state and in kubectl get events I see comunicate:
0/7 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 4 node(s) had no available volume zone.

When I change cluster from Multiple AZs to only single AZ everything is working fine.
As I understand pods are not able to connect to PVs in different AZ than pods.

Can you help me with this case or redirect me to some who can help with that? I'm new to Kubernetes and to run my project I used docker-compose, kops and kompose (https://github.com/kubernetes/kompose)

kubectl version:

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:39:04Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:25:46Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

docker-compose version:

docker-compose version 1.23.1, build b02f1306
docker-py version: 3.5.0
CPython version: 3.6.7
OpenSSL version: OpenSSL 1.1.0f  25 May 2017

kops version:

Version 1.10.0 (git-8b52ea6d1)

kompose version:

1.17.0 (a74acad)

here are examples of my project database kubernetes files with claims:

database deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    kompose.cmd: kompose convert
    kompose.version: 1.17.0 (a74acad)
  creationTimestamp: null
  labels:
    io.kompose.service: database
  name: database
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      creationTimestamp: null
      labels:
        io.kompose.service: database
    spec:
      containers:
        image: mysql:5.7
        name: DATABASE_NAME
        ports:
        - containerPort: 3306
        resources: {}
        volumeMounts:
        - mountPath: /var/lib/mysql
          name: database-claim0
        - mountPath: /etc/mysql/conf.d
          name: database-claim1
      restartPolicy: Always
      volumes:
      - name: database-claim0
        persistentVolumeClaim:
          claimName: database-claim0
      - name: database-claim1
        persistentVolumeClaim:
          claimName: database-claim1
status: {}

database claim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  creationTimestamp: null
  labels:
    io.kompose.service: database-claim0
  name: database-claim0
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Mi
status: {}

Source

wszychta

Most helpful comment

Hello @wszychta

As I understand pods are not able to connect to PVs in different AZ than pods.

If you are specifically using AWS EBS volumes for the PVs this is true, yes.
This is currently a hard limitation for AWS EBS volumes, you can read a bit about that here: An EBS volume and the instance to which it attaches must be in the same Availability Zone.

Kubernetes itself supports many other storage backends that could be used zone independently, but of course with different properties (like performance, pricing, cloud provider support, ...). For example there is AWS EFS that can be used in any AZ within an AWS region but with its own tradeoffs (e.g. https://github.com/kubernetes-incubator/external-storage/issues/1030).

AWS EBS is definitely a common option with Kubernetes PVs and if you know the limitations you can build your application deployments around that.
One workaround, as you already mentioned, to just put the whole cluster into a single AZ is usually insufficient, as you lose important high-availability properties.
The other option I would recommend for Kubernetes 1.10/1.11 is to control where your volumes are created and where your pods are scheduled:

To create volumes in pre-determined zones, you can create custom StorageClass objects for each zone you want to use (see https://kubernetes.io/docs/concepts/storage/storage-classes/#aws-ebs).
To specify the zones where your pods with PVs are scheduled, you can use affinity or nodeSelector: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
If you are using cluster auto-scaling, keep in mind that you probably need separate auto-scaling-groups for each AZ (see https://github.com/kubernetes/autoscaler/issues/501)

You can also read a bit about this here: https://github.com/kubernetes/kubernetes/issues/34583

There are some bigger improvements coming with Kubernetes 1.13 (and 1.12 as a beta feature), which you can read about here: https://kubernetes.io/blog/2018/10/11/topology-aware-volume-provisioning-in-kubernetes/

From the Kops side, I think there is little that can be done at the moment to change anything about this situation (except for working towards 1.13 😉).

Pharb on 26 Dec 2018

👍19 ❤1

All 6 comments

Hello @wszychta

As I understand pods are not able to connect to PVs in different AZ than pods.

To create volumes in pre-determined zones, you can create custom StorageClass objects for each zone you want to use (see https://kubernetes.io/docs/concepts/storage/storage-classes/#aws-ebs).
To specify the zones where your pods with PVs are scheduled, you can use affinity or nodeSelector: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
If you are using cluster auto-scaling, keep in mind that you probably need separate auto-scaling-groups for each AZ (see https://github.com/kubernetes/autoscaler/issues/501)

You can also read a bit about this here: https://github.com/kubernetes/kubernetes/issues/34583

From the Kops side, I think there is little that can be done at the moment to change anything about this situation (except for working towards 1.13 😉).

Pharb on 26 Dec 2018

👍19 ❤1

you may be able to work around this by pinning deployments to one az for now, I am looking forward to not having to do so in the future, if you ensure that deployments happen in one az things should be ok. Match on this label failure-domain.beta.kubernetes.io/zone=ap-northeast-1d and things will work

ms4720 on 26 Dec 2018

Thank you for your answers.
I think (for now) I'm going to use only one AZ and wait for kops to support kubernetes 1.13 :)

wszychta on 2 Jan 2019

The WaitForFirstConsumer property mentionned in the 1.13 blog post is working well when you spawn your containers.
But if you ever need to upgrade underlying servers, to select another instance type for example, and you rollout update them with instance refresh, the container on it will be killed, and may attempt to reschedule elsewhere where his PVC actually don't exist. (At this time, the PVC has not been destroyed, so it does not attempt to create a fresh one, so we won't benefit from the "WaitForFirstConsumer" property here. It will just failed

Anyone experienced this ?

JnMik on 13 Aug 2020

The WaitForFirstConsumer property mentionned in the 1.13 blog post is working well when you spawn your containers.
But if you ever need to upgrade underlying servers, to select another instance type for example, and you rollout update them with instance refresh, the container on it will be killed, and may attempt to reschedule elsewhere where his PVC actually don't exist. (At this time, the PVC has not been destroyed, so it does not attempt to create a fresh one, so we won't benefit from the "WaitForFirstConsumer" property here. It will just failed

Anyone experienced this ?

Strange, how does that make sense, in the documentation it says:

When persistent volumes are created, the PersistentVolumeLabel admission controller automatically adds zone labels to them. The scheduler (via the VolumeZonePredicate predicate) will then ensure that pods that claim a given volume are only placed into the same zone as that volume, as volumes cannot be attached across zones.

https://kubernetes.io/docs/setup/best-practices/multiple-zones/#functionality