Argo: MountVolume.SetUp failed for volume "docker-sock" & "docker-lib"

Created on 14 Apr 2018  路  14Comments  路  Source: argoproj/argo

BUG REPORT ?

What happened:

Try to deploy the basic hello-world.yaml example in a Kubernetes cluster on Azure AKS

Sound like it cannot mount the docker socket and the lib

  Warning  FailedMount            12s (x6 over 27s)  kubelet, aks-nodepool1-21279999-2  MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
  Warning  FailedMount            12s (x6 over 27s)  kubelet, aks-nodepool1-21279999-2  MountVolume.SetUp failed for volume "docker-lib" : hostPath type check failed: /var/lib/docker is not a directory

How to reproduce it (as minimally and precisely as possible):

argo submit hello-world.yaml

Environment:

  • Argo version:
v2.1.0-beta2
  • Kubernetes version :
1.9.1 (RBAC disabled)

Other debugging information (if applicable):

  • workflow result:
$ argo get tf-workflow-5jcpn-3759387957
...
Running
...
  • executor logs:
Name:           tf-workflow-5jcpn-3759387957
Namespace:      tfworkflow
Node:           aks-nodepool1-21279999-2/10.240.0.4
Start Time:     Fri, 13 Apr 2018 13:51:51 -0400
Labels:         workflows.argoproj.io/completed=false
                workflows.argoproj.io/workflow=tf-workflow-5jcpn
Annotations:    workflows.argoproj.io/node-name=tf-workflow-5jcpn[0].get-workflow-info
                workflows.argoproj.io/template={"name":"get-workflow-info","inputs":{},"outputs":{"parameters":[{"name":"s3-model-url","valueFrom":{"path":"/tmp/s3-model-url"}},{"name":"s3-exported-url","valueFrom":{...
Status:         Pending
IP:
Controlled By:  Workflow/tf-workflow-5jcpn
Containers:
  main:
    Container ID:
    Image:         nervana/circleci:master
    Image ID:
    Port:          <none>
    Command:
      echo 's3://tfjob/models/myjob-07b1d/' | tr -d '[:space:]' > /tmp/s3-model-url; echo 's3://tfjob/models/myjob-07b1d/export/mnist/' | tr -d '[:space:]' > /tmp/s3-exported-url
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-cpbjn (ro)
  wait:
    Container ID:
    Image:         argoproj/argoexec:v2.1.0-beta2
    Image ID:
    Port:          <none>
    Command:
      argoexec
    Args:
      wait
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      ARGO_POD_IP:      (v1:status.podIP)
      ARGO_POD_NAME:   tf-workflow-5jcpn-3759387957 (v1:metadata.name)
      ARGO_NAMESPACE:  tfworkflow (v1:metadata.namespace)
    Mounts:
      /argo/podmetadata from podmetadata (rw)
      /var/lib/docker from docker-lib (ro)
      /var/run/docker.sock from docker-sock (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-cpbjn (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  podmetadata:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  docker-lib:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker
    HostPathType:  Directory
  docker-sock:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/docker.sock
    HostPathType:  Socket
  default-token-cpbjn:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-cpbjn
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                 Age                From                               Message
  ----     ------                 ----               ----                               -------
  Normal   Scheduled              28s                default-scheduler                  Successfully assigned tf-workflow-5jcpn-3759387957 to aks-nodepool1-21279999-2
  Normal   SuccessfulMountVolume  27s                kubelet, aks-nodepool1-21279999-2  MountVolume.SetUp succeeded for volume "podmetadata"
  Normal   SuccessfulMountVolume  27s                kubelet, aks-nodepool1-21279999-2  MountVolume.SetUp succeeded for volume "default-token-cpbjn"
  Warning  FailedMount            12s (x6 over 27s)  kubelet, aks-nodepool1-21279999-2  MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
  Warning  FailedMount            12s (x6 over 27s)  kubelet, aks-nodepool1-21279999-2  MountVolume.SetUp failed for volume "docker-lib" : hostPath type check failed: /var/lib/docker is not a directory
bug

Most helpful comment

So this seems to be the underlying cause:
https://github.com/kubernetes/kubernetes/issues/61801
The fix will be in 1.9.7

All 14 comments

@julienstroheker This seems to be a new issue which is sepecific for AKS. By any chance, could you ssh to the worker node to find the docker.socket and docker library location?

Hi @wanghong230

/var/run for the socket and /var/lib are the correct location in my workers.

Do I need specific options in the kubernetes api to run Argo?

Can you run the following commands on any one of your minions?

$ sudo stat /var/run/docker.sock
$ sudo stat /var/lib/docker
$ sudo ls /var/lib/docker
azureuser@aks-nodepool1-21279999-0:~$ sudo stat /var/run/docker.sock
  File: '/var/run/docker.sock'
  Size: 0           Blocks: 0          IO Block: 4096   socket
Device: 17h/23d Inode: 512         Links: 1
Access: (0660/srw-rw----)  Uid: (    0/    root)   Gid: (  999/  docker)
Access: 2018-04-16 13:09:33.171498425 +0000
Modify: 2018-04-12 18:54:24.872367314 +0000
Change: 2018-04-12 18:54:24.872367314 +0000
 Birth: -
azureuser@aks-nodepool1-21279999-0:~$ sudo stat /var/lib/docker
  File: '/var/lib/docker'
  Size: 4096        Blocks: 8          IO Block: 4096   directory
Device: 801h/2049d  Inode: 256275      Links: 11
Access: (0711/drwx--x--x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-04-12 20:46:11.119900990 +0000
Modify: 2018-04-12 18:54:21.716380333 +0000
Change: 2018-04-12 18:54:24.900367207 +0000
 Birth: -
azureuser@aks-nodepool1-21279999-0:~$ sudo ls /var/lib/docker/
containers  image  network  overlay2  plugins  swarm  tmp  trust  volumes

Very strange, based on that stat output, I can't understand how kubernetes could be complaining:

hostPath type check failed: /var/run/docker.sock is not a socket file
hostPath type check failed: /var/lib/docker is not a directory

The stat command clearly states those files belonging to the expected file types.

@julienstroheker, we have some suspicions that Azure might have some special security safeguards which is preventing the mounting of hostPaths, which we are not aware of. Unfortunately, I don't have an Azure cluster to experiment.

If possible, could you run a pod (not via a workflow) which tries to mount something like /var/tmp, and let us know if it works?

  volumes:
  - name: test-volume
    hostPath:
      path: /var/tmp
      type: Directory

If that succeeds, could you run the pod again which tries to mount /var/lib/docker (not via workflow). The goal is to determine what AKS is permitting with regards to mounting hostPaths.

NOTE: it is important to specify the type: of the hostPath, because the errors seem to be stemming from here:
https://github.com/kubernetes/kubernetes/blob/9dd81555b07713002cc895b159740143e3d48f67/pkg/volume/host_path/host_path.go#L428

@jessesuen thanks for the answer, I'll try and let you know.

@jessesuen After running some tests this what I have :

Deploying

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /var/tmp
          name: test-volume
      volumes:
        - name: test-volume
          hostPath:
            path: /var/tmp
            type: Directory

Describe :

...
Events:
  Type     Reason                 Age                From                               Message
  ----     ------                 ----               ----                               -------
  Normal   Scheduled              34s                default-scheduler                  Successfully assigned nginx-deployment-5cd56d8c94-bwpcn to aks-nodepool1-21279999-0
  Normal   SuccessfulMountVolume  34s                kubelet, aks-nodepool1-21279999-0  MountVolume.SetUp succeeded for volume "default-token-9jl8t"
  Warning  FailedMount            18s (x6 over 34s)  kubelet, aks-nodepool1-21279999-0  MountVolume.SetUp failed for volume "test-volume" : hostPath type check failed: /var/tmp is not a directory

Now when I am removing the type: and deploying :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /var/tmp
          name: test-volume
      volumes:
        - name: test-volume
          hostPath:
            path: /var/tmp

I have now :

...
Events:
  Type    Reason                 Age   From                               Message
  ----    ------                 ----  ----                               -------
  Normal  Scheduled              13s   default-scheduler                  Successfully assigned nginx-deployment-565c95c98c-h882k to aks-nodepool1-21279999-1
  Normal  SuccessfulMountVolume  13s   kubelet, aks-nodepool1-21279999-1  MountVolume.SetUp succeeded for volume "test-volume"
  Normal  SuccessfulMountVolume  13s   kubelet, aks-nodepool1-21279999-1  MountVolume.SetUp succeeded for volume "default-token-9jl8t"
  Normal  Pulling                11s   kubelet, aks-nodepool1-21279999-1  pulling image "nginx:1.7.9"
  Normal  Pulled                 5s    kubelet, aks-nodepool1-21279999-1  Successfully pulled image "nginx:1.7.9"
  Normal  Created                5s    kubelet, aks-nodepool1-21279999-1  Created container
  Normal  Started                5s    kubelet, aks-nodepool1-21279999-1  Started container

This is odd ... I am curious to understand if it is related to AKS or not... Did you already see something similar ?

Here are more tests :

  • AKS v1.9.6 : Same comportment - Failed mount
  • AKS v1.8.10 : Works !
  • Minikube v1.10 Works !

Thanks for the pointers, I'm at a loss for an explanation. I posed the question in #sig-azure. I'll update on what I find.

So this seems to be the underlying cause:
https://github.com/kubernetes/kubernetes/issues/61801
The fix will be in 1.9.7

Also, this is windows disk (i.e. Azure) specific, according to the PR fix:
https://github.com/kubernetes/kubernetes/pull/62250

Thanks @jessesuen good to know !

Hi - I'm still seeing this in a cluster not running docker (cri-o://1.18.1) is there a workaround?

Was this page helpful?
0 / 5 - 0 ratings