kops 🚀 - We should figure out what to do with instance storage / root disks / btrfs

Moving https://github.com/kubernetes/kube-deploy/issues/118

justinsb on 12 Sep 2016

We doing aufs or btrfs?

chrislovecnm on 27 Oct 2016

For the docker instances we're doing aufs or overlay. We should revisit that as other approaches get more testing.

For using instance storage, we should use whatever is appropriate for whatever we decide to use it for :-)

justinsb on 27 Oct 2016

Then what should we use.

chrislovecnm on 28 Oct 2016

Is this still on the roadmap?

kha0S on 17 May 2017

👍4

Adding a +1 on the need for exposing instance storage - the new AWS i3 instances bench at 18GB/sec+ on instance storage (NVMe-based), which is substantially higher than EBS.

jfenton on 10 Jun 2017

👍19 🚀1

We also would like to expose the instance storage, also for i3 instances. I'm not sure I agree that AWS is moving away from instance storage -- they are just moving them to a new style of instance.

d-shi on 7 Jul 2017

Not only do the i3s have amazing iops performance, the d2 instance class has the most cost-efficient storage available on AWS... 6TB for $150 a month is almost as cheap as s3.

chicagobuss on 12 Sep 2017

👍2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot on 5 Jan 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot on 9 Feb 2018

+1 for supporting instance storage. I3's performance is great

jolcese on 13 Feb 2018

/lifecycle frozen
/remove-lifecycle stale

chrislovecnm on 13 Feb 2018

I realise its probably implicit; but for reliability having the kubelet working space (logs, tmp) on a different volume to the pod storage as well as container writable layers is really important - whatever is done here, please do preserve that (at least as an option).

rbtcollins on 5 Apr 2018

Would _really_ like to see this incorporated into kops sooner rather than later, especially now that local PVs are beta in k8s 1.10.

ChristopherHanson on 17 May 2018

Are there any workarounds how to use the instance storage for pods that benefit from the extra speed of the storage optimized instances?

Hermain on 17 Jul 2018

Would love to see this issue get some love. The NVMe instance storage on i3 instances is so fast and useful. I think that many of us would see a big jump in the utility of our instances if this was available for emptyDir and Docker image storage.

chrissnell on 22 Jul 2018

👍3

I'm interested in helping out if someone could get me pointed in the right direction. I'm pretty new to the kops codebase.

chrissnell on 26 Jul 2018

@justinsb asked on Slack for a use case for this issue, so here's mine:

We are doing CI on Kubernetes, running our software builds in pods that leverage emptyDir scratch directories for code fetches and compiles. It's very I/O intensive, so we chose i3.large instances. Unfortunately, without access to the NVMe disk, these builds are slow as molasses. Without NVMe access, there's no reason to use i3 instance with kops/Kubernetes.

We really need these volumes and I'm willing to take a stab at implementing this but I need someone to point me in the right direction because I'm not very familiar with the kops codebase.

Thanks.

chrissnell on 31 Jul 2018

👍5

Another usecase:

We have a Kafka cluster running on kubernetes. Kafka takes care of data replication. We stream large amounts of data onto this kafka cluster. The bottleneck is disk bandwidth.

--> We want i3 instances with NVMe to maximize our performance.

Hermain on 31 Jul 2018

Our use case is similar to Hermain's in that we are running a pod-based Cassandra cluster and also want to maximize disk performance by using the locally attached storage rather than ebs volumes.

ChristopherHanson on 31 Jul 2018

So, would /var/lib/kubelet/pods be the place to mount this drive? or /var/lib, so that it hosts both /var/lib/docker and /var/lib/kubelet/pods (where emptyDir volumes live).

chrissnell on 31 Jul 2018

My use case is just removing EBS from my cluster -- I'd like to just use instance store as the root filesystem since there is no need for EBS, nothing useful is persisted only there.

scopej on 31 Jul 2018

Excited bandwagon joiner here, I'm running spark on kubernetes and am trying to remove my swapfile read/write as a bottleneck with the new r5d instances. I see that support was added for the instance type in a fairly cursory way to the master branch, but unless I'm missing something it doesn't seem to affect anything other than performing the block mapping. I'd really love to see these disks get mounted in a way that my pods will automatically use them as scratch disk, and ideally I'd also like to get those disks configured in a RAID 0 configuration.

thejosephstevens on 1 Aug 2018

@chrissnell I think /var/lib/kubelet/pods would do what we need. As any application running in kubernetes can take advantage. What reasons are there to mount var/lib instead? I understand @scopej who doesn't want any ebs at all. But if you have an ebs you might as well use it for everything except /var/lib/kubelet/pods right?

Hermain on 2 Aug 2018

Joining the chorus here.
Running stateless apps on a kubernetes cluster. We have no need to store state in said apps. The ones we do need (grafana, prometheus, etc) we're using Statefulsets to do the trick. That pretty much renders our need for EBS close to zero.

Our cluster today runs on Container Linux by CoreOS 1800.6.0 (Rhyolite) but would gladly change it do debian strech if support comes for it first.

lucazz on 15 Aug 2018

In addition to the kubelet reliability issue https://github.com/kubernetes/kops/issues/429#issuecomment-378774435 which accounts for maybe 10% of our node failures, there is a second use case that hasn't been presented thus far, which is a variation on the local-storage-fast use case.

And thats running glusterfs/ceph on the kubelet nodes.

We're considering running such a thing, and I'm speculating as we haven't got into deep design yet, but something like:
EBS for data volume on a given storage node
local NVME for write-through cache and possibly hot object read cache (in the event that local NVME storage size exceeds main memory).

Why would we do this? Why does it make sense given EBS's brilliant performance?

Availability: EBS volumes in EC2 fail - we've had multiple outages for stateful singleton components due to EC2 hypervisor failures, just in the last year. (When a hypervisor fails, the EBS volumes on it cannot be remounted elsewhere for some time - we've seen 45 minutes) - my understanding is that EBS needs to fence the hypervisor to be sure no writes will be submitted from the hypervisor's EBS driver back to the EBS volumes assigned to it, and until that fencing is confirmed, the EBS volume cannot be reattached elsewhere, even with the force feature in the EBS API.
As such, being able to tolerate such failures means either retooling these singletons (which is sometimes a very big job :)) or having a storage driver that doesn't require the hypervisor to be fenced.

AWS NFS is also an option there of course, and yes, we're reviewing it :)

rbtcollins on 16 Aug 2018

👍1

Hoping that @justinsb can chime in here and point me and the others in the right direction. This issue is super critical for us and blocking a big project and I want to give it a shot but not sure where to start in the codebase.

chrissnell on 27 Aug 2018

So, just jumping back in here with what I talked about in office hours earlier. At least for the r5d.4xlarge instances, adding this to the top of my user_data file works:
```sudo apt-get -y install mdadm
sudo mdadm --create --verbose /dev/md0 --level=0 --name=empty_dir --raid-devices=2 /dev/nvme1n1 /dev/nvme2n1
sudo mkfs.ext4 -L empty_dir /dev/md0
sudo mkdir /var/lib/docker/overlay
sudo mount LABEL=empty_dir /var/lib/docker/overlay

A few notes here, you need to map the ephemeral volumes correctly for your instance to pick them up. In terraform that looks like this

resource "aws_launch_configuration" "default-spark-cluster" {
name_prefix = "default.spark.cluster-"
image_id = "ami-050a5ee88521c50e4"
instance_type = "r5d.4xlarge"
key_name = "${aws_key_pair.kubernetes-spark-cluster-4e533df7fa5cd8b4500a7bb98d719b11.id}"
iam_instance_profile = "${aws_iam_instance_profile.nodes-spark-cluster.id}"
security_groups = ["${aws_security_group.nodes-spark-cluster.id}"]
associate_public_ip_address = false
user_data = "${file("${path.module}/data/aws_launch_configuration_default.spark.cluster_user_data")}"

ephemeral_block_device {
virtual_name = "ephemeral0"
device_name = "/dev/sdb"
}

ephemeral_block_device {
virtual_name = "ephemeral1"
device_name = "/dev/sdc"
}

ephemeral_block_device {
virtual_name = "ephemeral2"
device_name = "/dev/sdd"
}

lifecycle = {
create_before_destroy = true
}

enable_monitoring = false
}
``Where the first ephemeral block device is the 8 GB root device and the second two are ~280 GB drives, all NVMe. This is working with the latest k8s-1.8-stretch imagek8s-1.8-debian-stretch-amd64-hvm-ebs-2018-08-17 (ami-050a5ee88521c50e4)because in stretch the default debconf frontend is some form of non-interactive that accepts-y. This actually isn't even ideal though, I'd love for the large block devices to be mounted as the root volume, but it's unclear to me how to do that, not sure if it would have to be baked into a custom AMI. Additionally, I couldn't mount it any lower than/var/lib/docker/overlay(I would've been happier with/var), but I wasn't able to work through issues of blowing up necessary files and directories on mounting (I tried moving them before the mounting and moving them back but couldn't get that to work consistently either, may have run into a race condition). Additionally, this obviously would need to be abstracted to support machines with different numbers of ephemeral disks, and a more general solution to catch all the desired storage would be good (I figured out that my spark jobs weren't using/var/lib/kubelet/podsas storage, everything was in/var/lib/docker/overlay`)
cc @justinsb

thejosephstevens on 31 Aug 2018

👍1

I'm still pretty new to kops development so I'm hoping that someone can set me straight here.
It appears that the proper place to do the management of the devices is in nodeup, specifically the AWS-specific code here: https://github.com/kubernetes/kops/tree/master/upup/pkg/fi/cloudup/awsup

The instance types and their ephemeral storage (if any) are defined here: https://github.com/kubernetes/kops/blob/master/upup/pkg/fi/cloudup/awsup/machine_types.go

It feels like nodeup should detect the presence of ephemeral disks and issue the mkfs.ext4 commands on the volumes. I'm not so sure how they would be mounted. Would this be defined in the kops InstanceGroup resource spec? Perhaps we could have something like this:

apiVersion: kops/v1alpha2
kind: InstanceGroup
[...]
spec:
   machineType: m5d.4xlarge
   ephemeralDisks:
   - ephemeral0
     mountPoint: /var/lib
   - ephemeral1
     mountPoint: /scratch

Other thoughts....

The use of these ephemeral disks for system directories like /var/lib is super tricky and the implementation varies from system to system, usually involving use of chroot. It's my opinion that this should be out of scope for the initial implementation of ephemeral functionality. Using /var/lib/kubelet/pods or /var/lib/docker, however, should be supported. For /var/lib/kubelet, the mount would have to happen before kubelet is started. For /var/lib/docker, we should probably be stopping docker before the mount and starting it again afterwards, especially for systems that enable docker by default (CoreOS).
I think that software RAID like @thejosephstevens is doing should be out of scope for the initial implementation. I think that having this capability is a great idea but significantly complicates the first implementation.

chrissnell on 31 Aug 2018

I'm also wondering how to set up ceph in my kops-managed k8s cluster. It seems like these i3 instances would be great but I wouldn't want kops to format them for me, since ceph would want to take over the device for me.

Also I've seen mongo recommends xfs as the underlying filesystem rather than ext4.

So, I think it would probably be something that should be configurable. I think ideally instead of having any default / automatic behavior here, adding a configuration section to the instance group that specifies what to do with extra volumes, e.g. whether or not to format them, what filesystem to use if so, and what path to mount them at, if any. On startup the instance would examine this configuration and format/mount the disks as specified.

At this point, though, perhaps instead of actual new configuration options, a simpler solution might just be add some examples to the docs how to use hooks to setup these volumes to your taste, you just have to run a few commands to format and mount the disks at the location of your choice, right?

dobesv on 10 Oct 2018

I think volume setup shouldn't be a part of kops unless bringing the node into the cluster actually requires the setup part. If you want to use ephemeral storage for the docker directory then that should be part of kops. But for using the ephemeral storage as a ceph node, it should not be in kops. For applications like ceph or mongo, you should probably just run a daemonset which mounts a hostPath and formats it directly, then exposes it. It's a more generic and higher level way to configure your hosts.

hubt on 10 Oct 2018

As a workaround, we used the additionalUserData in the IG spec to instruct cloud-init to place the ephemeral node storage on a c3.large instance on a given path, like this example:

spec:
  additionalUserData:
  - name: local-storage.txt
    type: text/cloud-config
    content: |
      #cloud-config
      mounts:
       - [ xvdc, /var/local/mnt/lv00, "auto", "defaults,nofail", "0", "0" ]
       - [ xvdd, /var/local/mnt/lv01, "auto", "defaults,nofail", "0", "0" ]

Then used the local storage provisioner (https://github.com/kubernetes-incubator/external-storage/tree/master/local-volume), specifying the parent /var/local/mnt path as the discovery directory, making the ephemeral storage available to pods:

NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM     STORAGECLASS    REASON    AGE
local-pv-1526796f   14Gi       RWO            Delete           Available             local-storage             1m
local-pv-30334a4a   14Gi       RWO            Delete           Available             local-storage             1m
local-pv-38820a0b   14Gi       RWO            Delete           Available             local-storage             1m
local-pv-73d1578a   14Gi       RWO            Delete           Available             local-storage             1m
...

Although the Node still gets an EBS volume as its root from kops, at least the fast local storage can be used for I/O intensive workloads, satisfying our use-case.

ChristopherHanson on 10 Oct 2018

👍1

Just a note on the complications around software raid that I ran into, it was fairly trivial on the newest k8s stretch AMIs. I did run into issues on the jessie images because of a debconf setting calling for UI interaction for post-install hooks which meant mdadm splashed a blue config screen the first time I tried this manually. I was actually unable to successfully change this configuration in bootstrap prior to installing mdadm (although I'm sure I was missing something). Outside of unusual install hooks though, if your node is fully ephemeral and the storage is fully ephemeral, you don't need to consider your reboot configuration settings, which was the only other option that there seemed to be. I think there's obviously testing to be done as to how this would consistently operate on various OS's (I didn't need to solve for ubuntu, coreOS, amazon linux, etc), but the process itself was pretty trivial. I haven't run into any cases where the RAIDing process has failed. I have been using this for about a month and a half now and it really just works.

thejosephstevens on 10 Oct 2018

@thejosephstevens did you try setting DEBIAN_FRONTEND=noninteractive in your test?

I've also had to set UCF_FORCE_CONFFNEW=YES, and at least once had to set a zillion apt-get flags to say really, really, NO REALLY, use non-interactive setup. These were on an Ubuntu base, so YMMV.

apt-get --no-install-recommends --fix-broken --fix-missing --assume-yes --auto-remove --quiet -o DPkg::options::="--force-confdef" -o DPkg::options::="--force-confnew" install ...

007 on 21 Nov 2018

Yeah, tried that to no success. It ended up being a non-issue though once I moved to the most recent kops-1.10 stretch image (although I normally wouldn't advocate changing AMIs just to get different OS default settings). Caveat to my earlier posts though, software raid in one of my environments started freaking out (I ran into the md127 bug), so I ended up de-RAIDing my worker nodes. Without drilling further into that bug (not a current priority for me), I can't recommend my RAID setup from above. The non-RAIDed local drives are still working great though, and I'd be perfectly happy if kops built support for a mapping of local drives to directory paths and a file-system choice (or just default ext4). I think the main trick there is navigating the bootstrap priority so you don't get any races and blow out system data anywhere.

thejosephstevens on 21 Nov 2018

the suggestions in this post worked for me (md127 bug). create an array entry in /etc/mdadm/mdadm.conf and run update-initramfs -u

https://www.linuxquestions.org/questions/linux-newbie-8/issues-with-raid-creating-as-dev-md127-instead-of-what%27s-in-the-config-4175541446/

This is what i'm using, not sure it is the most elegant way but it's working:

spec:
  additionalUserData:
  - content: |
      #cloud-config
      repo_update: true
      packages:
      - mdadm

      runcmd:
      - sudo mdadm --create --verbose /dev/md0 --level=0 --name=0 --raid-devices=2 dev/nvme0n1 /dev/nvme1n1
      - sudo mkfs.ext4 -L 0 /dev/md0
      - sudo mkdir /data-1
      - sudo mount LABEL=0 /data-1
      - [ sudo, sh, -c, 'mdadm -Db /dev/md0 >> /etc/mdadm/mdadm.conf' ]
      - echo "ARRAY /dev/md0 $(grep -oE 'UUID=[0-9a-z]+:[0-9a-z]+:[0-9a-z]+:[0-9a-z]+' /etc/mdadm/mdadm.conf)" > /tmp/uuid
      - [ sudo, sh, -c, "echo $(cat /tmp/uuid) >> /etc/mdadm/mdadm.conf" ]
      - [ sudo, sh, -c, "sed '/name/d' /etc/mdadm/mdadm.conf > /tmp/uuid" ]
      - [sudo, sh, -c, "sudo sh -c "echo $(cat /tmp/uuid) | sudo tee /etc/mdadm/mdadm.conf > /dev/null"]
      - sudo update-initramfs -u
    name: local-storage.txt
    type: text/cloud-config

monkykap on 17 Dec 2018

😕1 👎1

I used ideas from this thread to get it working. This is for a single-volume NVMe drive as found on an AWS EC2 m5d.xlarge instance:

apiVersion: kops/v1alpha2
kind: InstanceGroup
spec:
  additionalUserData:
  - name: 00-prep-local-storage.sh
    type: text/x-shellscript
    content: |
      #!/bin/sh
      /sbin/mkfs.ext4 /dev/nvme1n1
  - name: 02-mount-disks.sh
    type: text/x-shellscript
    content: |
      #!/bin/sh
      mkdir /scratch
      /bin/mount /dev/nvme1n1 /scratch
      mkdir /scratch/pods
      mkdir /scratch/docker
      mkdir /var/lib/kubelet
      systemctl stop docker
      rm -rf /var/lib/docker
      ln -s /scratch/pods /var/lib/kubelet/
      ln -s /scratch/docker /var/lib/
      systemctl start docker

The downside of this approach is that the mkfs(8) is slow and adds a considerable amount of time to instance launching--at least 3-4 minutes.

chrissnell on 17 Dec 2018

👍1

FWIW, we've moved to a systemd-based solution now to avoid messing with docker and kube's storage after they start running, just got it running today.

  hooks:
  - name: volume-mount
    roles:
    - Node
    before:
    - kubelet.service
    - docker.service
    - logrotate.timer
    - docker-healthcheck.timer
    - kubernetes-iptables-setup.service
    - docker-healthcheck.service
    - logrotate.service
    manifest: |
      User=root
      Type=oneshot
      ExecStartPre=/bin/bash -c 'mkfs.ext4 -L docker_dir /dev/nvme1n1'
      ExecStartPre=/bin/bash -c 'mkdir -p /var/lib/docker/overlay'
      ExecStartPre=/bin/bash -c 'mount LABEL=docker_dir /var/lib/docker/overlay'
      ExecStartPre=/bin/bash -c 'mkfs.ext4 -L empty_dir /dev/nvme2n1'
      ExecStartPre=/bin/bash -c 'mkdir -p /var/lib/kubelet/pods'
      ExecStart=/bin/bash -c 'mount LABEL=empty_dir /var/lib/kubelet/pods'

There's absolutely more work to be done on this, I'd like better conditionality on it so this could just be applied to all nodes without just resulting in a failing systemd unit on differently configured machines, but I think there may be a model here to extend that doesn't require as much finagling around processes that depend on the potential mount points. I'm pretty sure this doesn't handle system restart though, so I wouldn't buy it wholesale.

thejosephstevens on 17 Dec 2018

👍9

I am using a user data solution to mount /var/lib/docker on a c5d ephemeral volume like what is posted above (thank you @chrissnell ). I skipped mounting /var/lib/kubelet/pods because kubelet cannot delete the container directories cannot delete directory /var/lib/kubelet/pods/ed35ba56-1595-11e9-a73a-0210bcc711f2: it is a mount point. But docker is running fine on the ephemeral volume and the node is in service however persistent volume claims are not mounting.

Warning  FailedMount             1m (x4 over 8m)    kubelet, ip-172-24-105-243.us-west-2.compute.internal  Unable to mount volumes for pod "volume-writer-7758ddcbcf-2554f_pvc-test(ba9f0953-1836-11e9-aeff-061c5ed62f3e)": timeout expired waiting for volumes to attach or mount for pod "pvc-test"/"volume-writer-7758ddcbcf-2554f". list of unmounted volumes=[mypvc]. list of unattached volumes=[mypvc default-token-7nhh6]

PVC's work on c5 types instances. I thought it was because of the device name mismatch from AWS api and the linux instance

# /var/log/daemon.log
Jan 14 21:09:35 ip-172-24-105-243 kubelet[1869]: I0114 21:09:35.719693    1869 operation_generator.go:486] MountVolume.WaitForAttach entering for volume "pvc-ba8767a4-1836-11e9-8700-028bb95a9f1c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-west-2c/vol-00744073e67ea96fd") pod "volume-writer-7758ddcbcf-2554f" (UID: "ba9f0953-1836-11e9-aeff-061c5ed62f3e") DevicePath "/dev/xvdbc"
root@ip-172-24-105-243:/home/admin# lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
nvme0n1     259:1    0   128G  0 disk 
└─nvme0n1p1 259:2    0     8G  0 part /
nvme1n1     259:0    0 186.3G  0 disk /scratch
nvme2n1     259:3    0     1G  0 disk

but devices have the nvme names on c5's and pvc work there. Not sure what's going on with that.
This is k8s version 1.10.7

Update: PVC's are working on c5d's in another cluster where I'm running a newer kops debian ami.
PVC's work on c5d on ami kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
but not on kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-08-17

skinney6 on 14 Jan 2019

It seems mkfs.ext4 hangs on default image for kops 1.12.1. It might be related to https://github.com/bargees/barge-os/issues/76 but not quite sure.

yujunz on 17 Jun 2019

Is there any plan to address this?

selslack on 6 Aug 2019

👍3

Is there a good step-by-step guide available for using NVME as PVCs in pods.
I read the discussion 5 times. Still unsure what to do?

jugrajsingh on 26 Oct 2019

If you're talking about the ephemeral on-host disks like in AWS I wouldn't recommend it for anything other than scratch disk. The way that I did it in my example above was to mount the disks at the paths in the OS that docker uses for basic container process storage (/var/lib/docker/overlay), and the empty_dir (/var/lib/kubelet/pods). Once you have storage mapped at those locations you can access it by adding this to your deploy template

        volumeMounts:
        - name: scratch
          mountPath: /tmp
      volumes:
      - name: scratch
        emptyDir: {}

Just be aware that all the contents of these disks will be lost if you lose the machine, so don't use it for anything you want to persist (prometheus metrics, logs, whatever).

Given my experiences with managing these disks in AWS, it's not clear to me that it was at all worth the effort. We spent a good amount of time debugging issues at runtime (see my mention of md127 further up) and just debugging the basic bootstrap process and I'm not sure that we got a meaningful performance bump on read/write speeds (our tests of disk perf seemed to indicate that these disks did not perform up to the spec of NVMe drives we were familiar with).

thejosephstevens on 30 Oct 2019

Hi,

we are using the NVMe drive provided by AWS with some instances, for now I use the following KOPS hook to mount the NVMe & to assing pods & containers onto it:

hooks:
  - name: nvme
    roles:
    - Node
    before:
    - kubelet.service
    - docker.service
    - logrotate.timer
    - docker-healthcheck.timer
    - kubernetes-iptables-setup.service
    - docker-healthcheck.service
    - logrotate.service
    manifest: |
      User=root
      Type=oneshot
      ExecStartPre=/bin/bash -c 'mkfs -t xfs /dev/nvme1n1'
      ExecStartPre=/bin/bash -c 'mkdir /scratch'
      ExecStartPre=/bin/bash -c 'mount /dev/nvme1n1 /scratch'
      ExecStartPre=/bin/bash -c 'mkdir /scratch/pods'
      ExecStartPre=/bin/bash -c 'mkdir /scratch/docker'
      ExecStartPre=/bin/bash -c 'rm -rf /var/lib/docker'
      ExecStartPre=/bin/bash -c 'ln -s /scratch/pods /var/lib/kubelet/'
      ExecStart=/bin/bash -c 'ln -s /scratch/docker /var/lib/'

This does work, and we saw improvement of our performances thanks to this local NVMe.
Anyway, we are now facing another issue regarding disk pressure.
Indeed, kubelet looks for disk space from its file system (see https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/), so it look for space in / and not in /scratch (NVMe mounted point).
Basically, if the / partition goes over 90% space disk usage, then all pod are evicted from the node even if the NVMe partition still have a lot of space.

Does anyone know if it's possible to fully move Kubelet & Docker onto the NVMe to avoid Kubelet polling space disk from / ?

pierre-fribourg-context on 23 Mar 2020

👍2

I have some containers that need fast temporary storage (around 100GB) we were using gp2 type AWS EBS volumes however they would quickly run out of burst balance. Local instance storage seemed like the perfect replacement as it would reduce the spend on slow EBS volumes and provide fast temporary storage. However I quickly found that Kubernetes doesn't seem to have quite implemented a way to use the local instance storage yet.

I wanted to use emptyDir volumes on my containers that needed fast local temporary storage so I tried moving /var/lib/kubelet to local instance storage by specifying it in the instance group configuration:

  volumeMounts:
    - device: /dev/nvme1n1
      filesystem: ext4
      path: /var/lib/kubelet

However like previous posters have mentioned I started seeing issues with disk pressure and the pods being evicted even though the local instance storage had only used 35% capacity.

Instead we have now switched to using a hostPath volume with an initContainer to set the correct permissions in the host directory. Our kops instance group now looks like this:

  volumeMounts:
    - device: /dev/nvme1n1
      filesystem: ext4
      path: /mnt/localssd

Relevant container configuration:

      initContainers:
      - name: fix-tmp-perms
        image: busybox
        securityContext:
          runAsUser: 0
        command: ["sh", "-c", "chown -R 201:201 /tmp/worker-temp; chmod 1777 /tmp/worker-temp; rm -rf /tmp/worker-temp/*"]
        volumeMounts:
        - name: worker-temp
          mountPath: /tmp/worker-temp

      volumes:
        - name: worker-temp
          hostPath:
            path: /mnt/localssd/worker-temp
            type: DirectoryOrCreate

What would be nice is to be able to specify the root volume in kops to use the local instance storage rather than having to be backed by EBS. I think this makes sense as the EBS volume is only used for temporary storage and is deleted when the instance is deleted.

kxesd on 9 Sep 2020

However like previous posters have mentioned I started seeing issues with disk pressure and the pods being evicted even though the local instance storage had only used 35% capacity.

@kxesd most likely you will have to wait for kubernetes&kops 1.19. The root cause is a bug in cAdvisor that was fixed only recently. This made kubernetes incorrectly detect the ImageFS partition and with it the random partition usage.

For more info, check https://github.com/google/cadvisor/pull/2586.

hakman on 9 Sep 2020

Kops: We should figure out what to do with instance storage / root disks / btrfs

Most helpful comment

All 46 comments

Related issues