Kops: Trouble creating custom ami with pre-pulled docker images

Created on 14 Sep 2017 · 6Comments · Source: kubernetes/kops

Hello,

kops version
Version 1.7.0 (git-e04c29d)

I attempted to create a custom ami to use with kops that has some pre-pulled docker images but ran into some issues.
I took the following steps:

Created an ec2 instance using the kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28 ami and 20Gb ebs volume attached.
Pulled some docker images to the machine.
Stopped the instance and created a snapshot then registered it as an ami.
Created a new ec2 instance specifying the newly registered ami to confirm my docker images were present.
After confirmation I then edited my node ig spec to included the new image id and root volume information:

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2017-08-28T14:54:51Z
  labels:
    kops.k8s.io/cluster: scpdcluster.k8s.local
  name: nodes-t2-small-frontend
spec:
  image: REDACTED/k8s-1.7-debian-jessie-amd64-hvm-ebs-cst-01 # my new ami
  machineType: t2.small
  maxSize: 2
  minSize: 1
  nodeLabels:
    dedicated: frontend
  role: Node
  rootVolumeSize: 20
  rootVolumeType: gp2
  subnets:
  - us-east-1a

run kops update cluster --yes
run kops rolling-update cluster --yes

The nodes are started in the cluster using my ami and correct root volume size and type. However, when I ssh to the node and run sudo docker images my images are not longer present, I only see the usual kubernetes images:

REPOSITORY                                                       TAG                 IMAGE ID            CREATED             SIZE
quay.io/external_storage/efs-provisioner                         latest              b4da30527798        13 days ago         49.53 MB
gcr.io/google_containers/cluster-autoscaler                      v0.6.1              71a3e8b29e06        4 weeks ago         145 MB
protokube                                                        1.7.0               f1aefdb5580c        7 weeks ago         363.4 MB
gcr.io/google_containers/kube-proxy                              v1.7.2              13a7af96c7e8        7 weeks ago         114.7 MB
gcr.io/google_containers/k8s-dns-sidecar-amd64                   1.14.4              38bac66034a6        11 weeks ago        41.81 MB
gcr.io/google_containers/k8s-dns-kube-dns-amd64                  1.14.4              a8e00546bcf3        11 weeks ago        49.38 MB
gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64             1.14.4              f7f45b9cb733        11 weeks ago        41.41 MB
gcr.io/google_containers/cluster-proportional-autoscaler-amd64   1.1.2-r2            7d892ca550df        3 months ago        49.64 MB
gcr.io/google_containers/pause-amd64                             3.0                 99e59f495ffa        16 months ago       746.9 kB

When a node is created in the cluster does something happen during provisioning that would cause my images to be removed?

Thank you

Source

opensean

Most helpful comment

@chrislovecnm Solved! --> just a difference in docker storage drivers

I investigated this further today and I took a look at the docker-engine package information on a node in my cluster and it looks like docker is not reinstalled otherwise I think the package info would reflect any changes in the 'Modfy' date:

$ stat /var/lib/dpkg/info/docker-engine.list
  File: ‘/var/lib/dpkg/info/docker-engine.list’
  Size: 5571        Blocks: 16         IO Block: 4096   regular file
Device: ca01h/51713d    Inode: 1574372     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2017-09-22 13:04:43.350631000 +0000
Modify: 2017-07-28 04:00:49.878627794 +0000
Change: 2017-07-28 04:00:49.882627934 +0000
 Birth: -

After digging through the daemon.log further I noticed that docker is using the overlay storage driver. When I start up an ec2 instance using the kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28 ami outside of a cluster, it defaults to the devicemapper storage driver. So when I build my custom ami starting with the kope.io/k8s-1.7-debian-jessie-amd64-hvm-ebs-2017-07-28 ami, prior to pulling my images, I first create a daemon.json config file in /etc/docker with the following contents (as described in the docker docs for storage drivers):

{
  "storage-driver": "overlay"
}

Then restart the docker service:

sudo systemctl restart docker

After pulling my images i found it was important to remove the /etc/docker/daemon.json file, before creating a snapshot, and registering the ami. If the file was not removed the ami generated could be used to successfully launch a node but the kubernetes components never installed/configured properly and the node never connected to the cluster.

For anyone interested, to automate this process I started an ec2 with a role that allows pulling from a private aws ecr and the following in the userdata:

#!/bin/bash
echo $'{
  "storage-driver": "overlay"
}' > /etc/docker/daemon.json

systemctl restart docker
rm /etc/docker/daemon.json
export DOCKERLOGIN=$(/usr/local/bin/aws ecr get-login --region us-east-1)
$(echo $DOCKERLOGIN)
docker pull  example/image1:0.1.0
docker pull example/image2:0.1.0

I then connected to the ec2 and made sure the images have finished pulling before stopping, creating a snapshot, and registering the ami. When I use the new ami in the cluster my images are there!

opensean on 22 Sep 2017

👍5

All 6 comments

Docker probably was re-installed, can you look at the daemon.log file for me. Nodeup does the docker installs

chrislovecnm on 14 Sep 2017

@chrislovecnm

I ran cat /var/log/daemon.log | grep nodeup | grep docker > daemon-parsed.log and then pulled this section out after a quick scan by eye:

Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.188874     787 executor.go:157] Executing task "Package/docker-engine": Package: docker-engine
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.189119     787 package.go:134] Listing installed packages: dpkg-query -f ${db:Status-Abbrev}${Version}\n -W docker-engine
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.195244     787 changes.go:80] Field changed "Source" actual="<nil>" expected="http://apt.dockerproject.org/repo/pool/main/d/docker-engine/docker-engine_1.12.6-0~debian-jessie_amd64.deb"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: W0915 02:12:36.195932     787 package.go:335] cannot apply package changes for "docker-engine": Package:
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.196631     787 changes.go:80] Field changed "Definition" actual="<nil>" expected="[Unit]\nDescription=Kubernetes Protokube Service\nDocumentation=https://github.com/kubernetes/kops\n\n[Service]\nExecStartPre=/bin/true\nExecStart=/usr/bin/docker run -v /:/rootfs/ -v /var/run/dbus:/var/run/dbus -v /run/systemd:/run/systemd --net=host --privileged --env KUBECONFIG=/rootfs/var/lib/kops/kubeconfig  protokube:1.7.0 /usr/bin/protokube --cloud=aws --containerized=true --dns-internal-suffix=internal.scpdcluster.k8s.local --dns=gossip --master=false --v=4\nRestart=always\nRestartSec=2s\nStartLimitInterval=0\n\n[Install]\nWantedBy=multi-user.target\n"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.198871     787 executor.go:157] Executing task "service/docker-healthcheck.timer": Service: docker-healthcheck.timer
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.199140     787 changes.go:80] Field changed "Definition" actual="<nil>" expected="[Unit]\nDescription=Trigger docker-healthcheck periodically\n\n[Timer]\nOnUnitInactiveSec=10s\nUnit=docker-healthcheck.service\n\n[Install]\nWantedBy=multi-user.target"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.199982     787 files.go:50] Writing file "/lib/systemd/system/docker-healthcheck.timer"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.203944     787 executor.go:157] Executing task "service/docker-healthcheck.service": Service: docker-healthcheck.service
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.204219     787 changes.go:80] Field changed "Definition" actual="<nil>" expected="[Unit]\nDescription=Run docker-healthcheck once\n\n[Service]\nType=oneshot\nExecStart=/opt/kubernetes/helpers/docker-healthcheck\n\n[Install]\nWantedBy=multi-user.target"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.204335     787 files.go:50] Writing file "/lib/systemd/system/docker-healthcheck.service"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.209776     787 changes.go:80] Field changed "Definition" actual="<nil>" expected="[Unit]\nDescription=Kubernetes Kubelet Server\nDocumentation=https://github.com/kubernetes/kubernetes\nAfter=docker.service\n\n[Service]\nEnvironmentFile=/etc/sysconfig/kubelet\nExecStart=/usr/local/bin/kubelet \"$DAEMON_ARGS\"\nRestart=always\nRestartSec=2s\nStartLimitInterval=0\nKillMode=process\n"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.210356     787 executor.go:157] Executing task "Service/docker.service": Service: docker.service
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.210463     787 service.go:123] querying state of service "docker.service"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.302591     787 service.go:344] Restarting service "docker-healthcheck.service"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.344586     787 changes.go:80] Field changed "Definition" actual="[Unit]\nDescription=Docker Application Container Engine\nDocumentation=https://docs.docker.com\nAfter=network.target docker.socket\nRequires=docker.socket\n\n[Service]\nType=notify\n# the default is not to use systemd for cgroups because the delegate issues still\n# exists and systemd currently does not support the cgroup feature set required\n# for containers run by docker\nExecStart=/usr/bin/dockerd -H fd://\nExecReload=/bin/kill -s HUP $MAINPID\n# Having non-zero Limit*s causes performance problems due to accounting overhead\n# in the kernel. We recommend using cgroups to do container-local accounting.\nLimitNOFILE=infinity\nLimitNPROC=infinity\nLimitCORE=infinity\n# Uncomment TasksMax if your systemd version supports it.\n# Only systemd 226 and above support this version.\n#TasksMax=infinity\nTimeoutStartSec=0\n# set delegate yes so that systemd does not reset the cgroups of docker containers\nDelegate=yes\n# kill only the docker process, not all processes in the cgroup\nKillMode=process\n\n[Install]\nWantedBy=multi-user.target\n" expected="[Unit]\nDescription=Docker Application Container Engine\nDocumentation=https://docs.docker.com\nAfter=network.target docker.socket\nRequires=docker.socket\n\n[Service]\nType=notify\nEnvironmentFile=/etc/sysconfig/docker\nExecStart=/usr/bin/dockerd -H fd:// \"$DOCKER_OPTS\"\nExecReload=/bin/kill -s HUP $MAINPID\nKillMode=process\nTimeoutStartSec=0\nLimitNOFILE=1048576\nLimitNPROC=1048576\nLimitCORE=infinity\nRestart=always\nRestartSec=2s\nStartLimitInterval=0\nDelegate=yes\nExecStartPre=/opt/kubernetes/helpers/docker-prestart\n\n[Install]\nWantedBy=multi-user.target\n"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.345464     787 files.go:50] Writing file "/lib/systemd/system/docker.service"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.376692     787 service.go:344] Restarting service "docker-healthcheck.timer"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.510966     787 service.go:242] extracted depdendency from "ExecStart=/usr/bin/dockerd -H fd:// \"$DOCKER_OPTS\"": "/usr/bin/dockerd"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.511002     787 service.go:123] querying state of service "docker.service"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.519388     787 service.go:333] will restart service "docker.service" because dependency changed after service start
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.519607     787 service.go:344] Restarting service "docker.service"
Sep 15 02:12:36 ip-10-50-255-61 nodeup[787]: I0915 02:12:36.523211     787 service.go:355] Enabling service "docker-healthcheck.timer"
Sep 15 02:12:39 ip-10-50-255-61 nodeup[787]: I0915 02:12:39.880080     787 service.go:355] Enabling service "docker-healthcheck.service"

Looks like something is going on here? However, there is a lot more to the log even after parsing with "nodeup" and "docker", I can post it all if you like. Also I can parse with an suggested keywords as well.

Thanks

opensean on 15 Sep 2017

Can you post the log in a gist, and link it here? I would need to recreate, and you can rerun nodeup to see if it clears the containers again on a node.

@justinsb ideas?

chrislovecnm on 15 Sep 2017

@chrislovecnm

Here is a link to the full daemon.log.

I have user data in the ami that was useful when building the ami that logs into the ecr and pulls the docker images for me. I can see at the beginning of the log it performs this action and confirms the images are present.

I connected to the node in the cluster and re-pulled the docker images manually. I then reran nodeup using the following command /var/cache/kubernetes-install/nodeup --conf=/var/cache/kubernetes-install/kube_env.yaml --v=8 2> rerun-nodeup.log and captured the output in this log:

rerun-nodeup.log

This had no effect and my images remained on the node. So something is happening when I first start up the node in the cluster.

opensean on 18 Sep 2017

@chrislovecnm Solved! --> just a difference in docker storage drivers

$ stat /var/lib/dpkg/info/docker-engine.list
  File: ‘/var/lib/dpkg/info/docker-engine.list’
  Size: 5571        Blocks: 16         IO Block: 4096   regular file
Device: ca01h/51713d    Inode: 1574372     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2017-09-22 13:04:43.350631000 +0000
Modify: 2017-07-28 04:00:49.878627794 +0000
Change: 2017-07-28 04:00:49.882627934 +0000
 Birth: -

{
  "storage-driver": "overlay"
}

Then restart the docker service:

sudo systemctl restart docker

For anyone interested, to automate this process I started an ec2 with a role that allows pulling from a private aws ecr and the following in the userdata:

#!/bin/bash
echo $'{
  "storage-driver": "overlay"
}' > /etc/docker/daemon.json

systemctl restart docker
rm /etc/docker/daemon.json
export DOCKERLOGIN=$(/usr/local/bin/aws ecr get-login --region us-east-1)
$(echo $DOCKERLOGIN)
docker pull  example/image1:0.1.0
docker pull example/image2:0.1.0

I then connected to the ec2 and made sure the images have finished pulling before stopping, creating a snapshot, and registering the ami. When I use the new ami in the cluster my images are there!

opensean on 22 Sep 2017

👍5

Sounds like this is resolved, feel free to re-open if not.

blakebarnett on 22 Sep 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Fully Scripted Creation?

pluttrell · 4Comments

error: error validating "cluster-autoscaler.yml": error validating data: found invalid field tolerations for v1.PodSpec; if you choose to ignore these errors, turn validation off with --validate=false

endejoli · 4Comments

Sublime does not seem to work on edit

chrislovecnm · 3Comments

CoreDNS externalCoreFile Parsing Invalid - Indentation

joshbranham · 3Comments

Cluster create fails with kops-version.txt not found

mikejoh · 3Comments