Cadvisor: [0.29.1 and above] cAdvisor cannot reach containerd.sock

Created on 22 Mar 2018 · 13Comments · Source: google/cadvisor

Hi,

I recieve the following error message on cadvisor starting cadvisor, however, cadvisor itself works fine and I couldn't see any specific metrics missing ( maybe I didn't observe enough).

I can reproduce this only on the following branches:

release-v0.29
master (as of reporting this issue today)

grpc: addrConn.resetTransport failed to create client transport: connection error:
desc = "transport: dial unix:///var/run/containerd/containerd.sock: timeout"; Reconnecting to {unix:///var/run/containerd/containerd.sock <nil>}```

Further, if I change the containerd socket path via an argument to cadvisor, cadvisor doesn't seem to be picking this as well.

./cadvisor -containerd "/run/docker/containerd/docker-containerd.sock" 

2018/03/21 14:48:47 grpc: addrConn.resetTransport failed to create client transport: connection error:
desc = "transport: dial unix:///var/run/containerd/containerd.sock: timeout"; Reconnecting to {unix:///var/run/containerd/containerd.sock <nil>}

I would like to repeat that cAdvisor in itself seems to be working fine despite this error message.
For now I have tried release-v0.28 and I don't get this error message.

Please let me know if you need any further details. I also think that this is related to https://github.com/google/cadvisor/issues/1895

Thanks!

Source

xgt001

👍2

Most helpful comment

For me still the same problem - and restarting the deamon or the system should be no option:
My compose-file snipped (running in swarm):

cadvisor:
  image: google/cadvisor:v0.29.0
  privileged: true
  networks:
    - mynet
  ports:
    - "9902:8080"
  volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:rw
    - /sys:/sys:ro # removing does not help
    - /var/lib/docker/:/var/lib/docker:ro
  #- /cgroup:/sys/fs/cgroup:ro # does not help
    - /dev/disk/:/dev/disk:ro
  #- /dev/mapper:/dev/mapper:ro # does not help
  deploy:
    mode: global
    endpoint_mode: vip
    resources:
      limits:
        memory: 256M
      reservations:
        memory: 64M
  healthcheck:
    test: wget --quiet --spider http://localhost:8080
    retries:       4
    interval:     30s
    timeout:      25s
    start_period: 60s

docker --version:
Docker version 18.03.1-ce, build 9ee9f40

cAdvisor log entry:

I0706 13:07:35.909949       1 storagedriver.go:50] Caching stats in memory for 2m0s,
I0706 13:07:35.911174       1 manager.go:154] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct",
I0706 13:07:35.975468       1 fs.go:142] Filesystem UUIDs: ...,
I0706 13:07:35.975581       1 fs.go:143] Filesystem partitions: ...,
I0706 13:07:35.983029       1 manager.go:227] Machine: ...,
I0706 13:07:35.984454       1 manager.go:233] Version: {KernelVersion:3.10.0-693.21.1.el7.x86_64 ContainerOsVersion:Alpine Linux v3.4 DockerVersion:18.03.1-ce DockerAPIVersion:1.37 CadvisorVersion:v0.29.0 CadvisorRevision:aaaa65d},
E0706 13:07:36.020107       1 factory.go:340] devicemapper filesystem stats will not be reported: usage of thin_ls is disabled to preserve iops,
I0706 13:07:36.020799       1 factory.go:356] Registering Docker factory,
I0706 13:07:38.021227       1 factory.go:54] Registering systemd factory,
I0706 13:07:38.022976       1 factory.go:86] Registering Raw factory,
I0706 13:07:38.024327       1 manager.go:1205] Started watching for new ooms in manager,
W0706 13:07:38.024367       1 manager.go:340] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory,
I0706 13:07:38.029194       1 manager.go:356] Starting recovery of all containers,
I0706 13:07:38.203094       1 manager.go:361] Recovery completed,
I0706 13:07:38.400871       1 cadvisor.go:163] Starting cAdvisor version: v0.29.0-aaaa65d on port 8080,
2018/07/06 13:07:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix:///var/run/containerd/containerd.sock: timeout"; Reconnecting to {unix:///var/run/containerd/containerd.sock <nil>},

Any ideas?

anthraxn8b on 6 Jul 2018

👍2

All 13 comments

@Random-Liu if this is intended, we should make the error message less crazy-looking. Maybe something like: Unable to connect to containerd. This is expected if you are not using containerd directly. Then we can separately log the error with lower verbosity afterwards.

@xgt001 We added containerd support recently (running directly against containerd, rather than docker). If you are using docker, you should expect to get this error message.

dashpole on 22 Mar 2018

Hi @dashpole I started having the same issue, but it was working fine yesterday.

I just removed all the cadvisor containers from my cluster and redeployed again, and then I started getting that error.

Could you highlight why cadvisor can't stop working? I still have the same version Docker version when it was working before. Docker version - 17.06.2-ee-6, build e75fdb8

cadvisor compose:

  cadvisor:
    image: google/cadvisor:v0.29.0
    command:
      - '--port=8484'
    ports:
      - target: 8484
        published: 8484
    networks:
      - net
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock,readonly
      - /var/run:/var/run:rw
      - /:/rootfs:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    deploy:
      mode: global
      restart_policy:
        condition: on-failure
        delay: 10s
        max_attempts: 20
      resources:
        limits:
          # cpus: '0.50'
          memory: 64M
        reservations:
          # cpus: '0.50'
          memory: 32M

mayconbeserra on 28 Mar 2018

Actually the log is pretty normal...

I0329 23:48:58.375924   29858 factory.go:356] Registering Docker factory
I0329 23:48:58.375967   29858 manager.go:302] Registration of the rkt container factory failed: unable to communicate with Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp 127.0.0.1:15441: connect: connection refused
I0329 23:48:58.376129   29858 manager.go:313] Registration of the containerd container factory failed: failed to fetch containerd client version: grpc: the connection is unavailable: unavailable
I0329 23:48:58.376327   29858 manager.go:318] Registration of the crio container factory failed: Get http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info: dial unix /var/run/crio/crio.sock: connect: no such file or directory

It's similar with rkt and cri-o. And the log level is already v 5. https://github.com/google/cadvisor/blob/master/manager/manager.go#L313

Random-Liu on 30 Mar 2018

Hi,

same as @xgt001 , cadvisor starts and logs up to this:

I0418 12:41:07.241078       1 manager.go:233] Version: {KernelVersion:3.10.0-693.21.1.el7.x86_64 ContainerOsVersion:Alpine Linux v3.4 DockerVersion:17.12.1-ce DockerAPIVersion:1.35 CadvisorVersion:v0.29.0 CadvisorRevision:aaaa65d}
I0418 12:41:07.270718       1 factory.go:356] Registering Docker factory
I0418 12:41:09.273406       1 factory.go:54] Registering systemd factory
I0418 12:41:09.277547       1 factory.go:86] Registering Raw factory
I0418 12:41:09.280324       1 manager.go:1205] Started watching for new ooms in manager
I0418 12:41:09.286467       1 manager.go:356] Starting recovery of all containers

then if I try "curl localhost:8080", it fails:

$ curl localhost:8080
curl: (56) Recv failure: Connection reset by peer

and the following error appears in the logs:

2018/04/18 12:41:27 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix:///var/run/containerd/containerd.sock: timeout"; Reconnecting to {unix:///var/run/containerd/containerd.sock <nil>}

jouve on 18 Apr 2018

👍1

Docker version: 17.12.1-ce
Cadvisor image: google/cadvisor:v0.29.0
Same here.

2018/05/10 15:31:19 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix:///var/run/containerd/containerd.sock: timeout"; Reconnecting to {unix:///var/run/containerd/containerd.sock <nil>}

Scukerman on 10 May 2018

Hey @xgt001 did you guys managed to resolve this issue?

I am also facing the same problem. When bind cadvisor with systemd daemon the container never comes up and stuck at the "Starting recovery of all containers". However the container starts up perfectly fine if I execute outside systemd.

usr/bin/docker run \
--name cadvisor \
--restart=always \
--memory=256m \
--detach=true \
--pid=host \
--privileged=true \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--volume=/cgroup:/cgroup:ro \
--publish=8572:8080 \
google/cadvisor:v0.28.0

ameyrk18 on 5 Jun 2018

@ameyrk18 I am sorry but your cadvisor version is not 0.29.x where I could reproduce it on my side. Could you try the same with 0.29.1 as well if possible?

xgt001 on 5 Jun 2018

@xgt001 Tried with 0.29.0 its same issue. Container never starts up when I start with systemd sometimes my docker daemon goes defunct.

ameyrk18 on 5 Jun 2018

@xgt001 how are you starting? Systemd or via docker compose or standalone?

ameyrk18 on 5 Jun 2018

Tried v0.30.0 same issues. @dashpole do you want me to share some logs?

ameyrk18 on 5 Jun 2018

For me still the same problem - and restarting the deamon or the system should be no option:
My compose-file snipped (running in swarm):

cadvisor:
  image: google/cadvisor:v0.29.0
  privileged: true
  networks:
    - mynet
  ports:
    - "9902:8080"
  volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:rw
    - /sys:/sys:ro # removing does not help
    - /var/lib/docker/:/var/lib/docker:ro
  #- /cgroup:/sys/fs/cgroup:ro # does not help
    - /dev/disk/:/dev/disk:ro
  #- /dev/mapper:/dev/mapper:ro # does not help
  deploy:
    mode: global
    endpoint_mode: vip
    resources:
      limits:
        memory: 256M
      reservations:
        memory: 64M
  healthcheck:
    test: wget --quiet --spider http://localhost:8080
    retries:       4
    interval:     30s
    timeout:      25s
    start_period: 60s

docker --version:
Docker version 18.03.1-ce, build 9ee9f40

cAdvisor log entry:

I0706 13:07:35.909949       1 storagedriver.go:50] Caching stats in memory for 2m0s,
I0706 13:07:35.911174       1 manager.go:154] cAdvisor running in container: "/sys/fs/cgroup/cpu,cpuacct",
I0706 13:07:35.975468       1 fs.go:142] Filesystem UUIDs: ...,
I0706 13:07:35.975581       1 fs.go:143] Filesystem partitions: ...,
I0706 13:07:35.983029       1 manager.go:227] Machine: ...,
I0706 13:07:35.984454       1 manager.go:233] Version: {KernelVersion:3.10.0-693.21.1.el7.x86_64 ContainerOsVersion:Alpine Linux v3.4 DockerVersion:18.03.1-ce DockerAPIVersion:1.37 CadvisorVersion:v0.29.0 CadvisorRevision:aaaa65d},
E0706 13:07:36.020107       1 factory.go:340] devicemapper filesystem stats will not be reported: usage of thin_ls is disabled to preserve iops,
I0706 13:07:36.020799       1 factory.go:356] Registering Docker factory,
I0706 13:07:38.021227       1 factory.go:54] Registering systemd factory,
I0706 13:07:38.022976       1 factory.go:86] Registering Raw factory,
I0706 13:07:38.024327       1 manager.go:1205] Started watching for new ooms in manager,
W0706 13:07:38.024367       1 manager.go:340] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory,
I0706 13:07:38.029194       1 manager.go:356] Starting recovery of all containers,
I0706 13:07:38.203094       1 manager.go:361] Recovery completed,
I0706 13:07:38.400871       1 cadvisor.go:163] Starting cAdvisor version: v0.29.0-aaaa65d on port 8080,
2018/07/06 13:07:56 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix:///var/run/containerd/containerd.sock: timeout"; Reconnecting to {unix:///var/run/containerd/containerd.sock <nil>},

Any ideas?

anthraxn8b on 6 Jul 2018

👍2

i can confirm the problem @anthraxn8b explained. We have the same issue here.