BUG REPORT ?
What happened:
Try to deploy the basic hello-world.yaml example in a Kubernetes cluster on Azure AKS
Sound like it cannot mount the docker socket and the lib
Warning FailedMount 12s (x6 over 27s) kubelet, aks-nodepool1-21279999-2 MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
Warning FailedMount 12s (x6 over 27s) kubelet, aks-nodepool1-21279999-2 MountVolume.SetUp failed for volume "docker-lib" : hostPath type check failed: /var/lib/docker is not a directory
How to reproduce it (as minimally and precisely as possible):
argo submit hello-world.yaml
Environment:
v2.1.0-beta2
1.9.1 (RBAC disabled)
Other debugging information (if applicable):
$ argo get tf-workflow-5jcpn-3759387957
...
Running
...
Name: tf-workflow-5jcpn-3759387957
Namespace: tfworkflow
Node: aks-nodepool1-21279999-2/10.240.0.4
Start Time: Fri, 13 Apr 2018 13:51:51 -0400
Labels: workflows.argoproj.io/completed=false
workflows.argoproj.io/workflow=tf-workflow-5jcpn
Annotations: workflows.argoproj.io/node-name=tf-workflow-5jcpn[0].get-workflow-info
workflows.argoproj.io/template={"name":"get-workflow-info","inputs":{},"outputs":{"parameters":[{"name":"s3-model-url","valueFrom":{"path":"/tmp/s3-model-url"}},{"name":"s3-exported-url","valueFrom":{...
Status: Pending
IP:
Controlled By: Workflow/tf-workflow-5jcpn
Containers:
main:
Container ID:
Image: nervana/circleci:master
Image ID:
Port: <none>
Command:
echo 's3://tfjob/models/myjob-07b1d/' | tr -d '[:space:]' > /tmp/s3-model-url; echo 's3://tfjob/models/myjob-07b1d/export/mnist/' | tr -d '[:space:]' > /tmp/s3-exported-url
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cpbjn (ro)
wait:
Container ID:
Image: argoproj/argoexec:v2.1.0-beta2
Image ID:
Port: <none>
Command:
argoexec
Args:
wait
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment:
ARGO_POD_IP: (v1:status.podIP)
ARGO_POD_NAME: tf-workflow-5jcpn-3759387957 (v1:metadata.name)
ARGO_NAMESPACE: tfworkflow (v1:metadata.namespace)
Mounts:
/argo/podmetadata from podmetadata (rw)
/var/lib/docker from docker-lib (ro)
/var/run/docker.sock from docker-sock (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cpbjn (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
podmetadata:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.annotations -> annotations
docker-lib:
Type: HostPath (bare host directory volume)
Path: /var/lib/docker
HostPathType: Directory
docker-sock:
Type: HostPath (bare host directory volume)
Path: /var/run/docker.sock
HostPathType: Socket
default-token-cpbjn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-cpbjn
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 28s default-scheduler Successfully assigned tf-workflow-5jcpn-3759387957 to aks-nodepool1-21279999-2
Normal SuccessfulMountVolume 27s kubelet, aks-nodepool1-21279999-2 MountVolume.SetUp succeeded for volume "podmetadata"
Normal SuccessfulMountVolume 27s kubelet, aks-nodepool1-21279999-2 MountVolume.SetUp succeeded for volume "default-token-cpbjn"
Warning FailedMount 12s (x6 over 27s) kubelet, aks-nodepool1-21279999-2 MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
Warning FailedMount 12s (x6 over 27s) kubelet, aks-nodepool1-21279999-2 MountVolume.SetUp failed for volume "docker-lib" : hostPath type check failed: /var/lib/docker is not a directory
@julienstroheker This seems to be a new issue which is sepecific for AKS. By any chance, could you ssh to the worker node to find the docker.socket and docker library location?
Hi @wanghong230
/var/run for the socket and /var/lib are the correct location in my workers.
Do I need specific options in the kubernetes api to run Argo?
Can you run the following commands on any one of your minions?
$ sudo stat /var/run/docker.sock
$ sudo stat /var/lib/docker
$ sudo ls /var/lib/docker
azureuser@aks-nodepool1-21279999-0:~$ sudo stat /var/run/docker.sock
File: '/var/run/docker.sock'
Size: 0 Blocks: 0 IO Block: 4096 socket
Device: 17h/23d Inode: 512 Links: 1
Access: (0660/srw-rw----) Uid: ( 0/ root) Gid: ( 999/ docker)
Access: 2018-04-16 13:09:33.171498425 +0000
Modify: 2018-04-12 18:54:24.872367314 +0000
Change: 2018-04-12 18:54:24.872367314 +0000
Birth: -
azureuser@aks-nodepool1-21279999-0:~$ sudo stat /var/lib/docker
File: '/var/lib/docker'
Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: 801h/2049d Inode: 256275 Links: 11
Access: (0711/drwx--x--x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2018-04-12 20:46:11.119900990 +0000
Modify: 2018-04-12 18:54:21.716380333 +0000
Change: 2018-04-12 18:54:24.900367207 +0000
Birth: -
azureuser@aks-nodepool1-21279999-0:~$ sudo ls /var/lib/docker/
containers image network overlay2 plugins swarm tmp trust volumes
Very strange, based on that stat output, I can't understand how kubernetes could be complaining:
hostPath type check failed: /var/run/docker.sock is not a socket file
hostPath type check failed: /var/lib/docker is not a directory
The stat command clearly states those files belonging to the expected file types.
@julienstroheker, we have some suspicions that Azure might have some special security safeguards which is preventing the mounting of hostPaths, which we are not aware of. Unfortunately, I don't have an Azure cluster to experiment.
If possible, could you run a pod (not via a workflow) which tries to mount something like /var/tmp, and let us know if it works?
volumes:
- name: test-volume
hostPath:
path: /var/tmp
type: Directory
If that succeeds, could you run the pod again which tries to mount /var/lib/docker (not via workflow). The goal is to determine what AKS is permitting with regards to mounting hostPaths.
NOTE: it is important to specify the type: of the hostPath, because the errors seem to be stemming from here:
https://github.com/kubernetes/kubernetes/blob/9dd81555b07713002cc895b159740143e3d48f67/pkg/volume/host_path/host_path.go#L428
@jessesuen thanks for the answer, I'll try and let you know.
@jessesuen After running some tests this what I have :
Deploying
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
volumeMounts:
- mountPath: /var/tmp
name: test-volume
volumes:
- name: test-volume
hostPath:
path: /var/tmp
type: Directory
Describe :
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 34s default-scheduler Successfully assigned nginx-deployment-5cd56d8c94-bwpcn to aks-nodepool1-21279999-0
Normal SuccessfulMountVolume 34s kubelet, aks-nodepool1-21279999-0 MountVolume.SetUp succeeded for volume "default-token-9jl8t"
Warning FailedMount 18s (x6 over 34s) kubelet, aks-nodepool1-21279999-0 MountVolume.SetUp failed for volume "test-volume" : hostPath type check failed: /var/tmp is not a directory
Now when I am removing the type: and deploying :
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
volumeMounts:
- mountPath: /var/tmp
name: test-volume
volumes:
- name: test-volume
hostPath:
path: /var/tmp
I have now :
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13s default-scheduler Successfully assigned nginx-deployment-565c95c98c-h882k to aks-nodepool1-21279999-1
Normal SuccessfulMountVolume 13s kubelet, aks-nodepool1-21279999-1 MountVolume.SetUp succeeded for volume "test-volume"
Normal SuccessfulMountVolume 13s kubelet, aks-nodepool1-21279999-1 MountVolume.SetUp succeeded for volume "default-token-9jl8t"
Normal Pulling 11s kubelet, aks-nodepool1-21279999-1 pulling image "nginx:1.7.9"
Normal Pulled 5s kubelet, aks-nodepool1-21279999-1 Successfully pulled image "nginx:1.7.9"
Normal Created 5s kubelet, aks-nodepool1-21279999-1 Created container
Normal Started 5s kubelet, aks-nodepool1-21279999-1 Started container
This is odd ... I am curious to understand if it is related to AKS or not... Did you already see something similar ?
Here are more tests :
Thanks for the pointers, I'm at a loss for an explanation. I posed the question in #sig-azure. I'll update on what I find.
So this seems to be the underlying cause:
https://github.com/kubernetes/kubernetes/issues/61801
The fix will be in 1.9.7
Also, this is windows disk (i.e. Azure) specific, according to the PR fix:
https://github.com/kubernetes/kubernetes/pull/62250
Thanks @jessesuen good to know !
Hi - I'm still seeing this in a cluster not running docker (cri-o://1.18.1) is there a workaround?
Most helpful comment
So this seems to be the underlying cause:
https://github.com/kubernetes/kubernetes/issues/61801
The fix will be in 1.9.7