Telegraf: Disk input gives error which disappears on restart

Created on 8 Jun 2016 · 27Comments · Source: influxdata/telegraf

NOTE this is currently attributed to an issue with systemd: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1555760

Bug report

System info:

Telegraf version 0.13.1
CoreOS version 1010.5.0
Many plugins running, including the disk input.

Steps to reproduce:

We can reproduce the issue in our environment.

Expected behavior:

Expect the Telegraf disk input to work correctly on startup.

Actual behavior:

The Telegraf disk input reports the following errors every collection cycle.

2016/06/08 18:59:20 Error in input [disk]: error getting disk usage info: too many levels of symbolic links

The input starts working correctly when Telegraf is restarted.

bug help wanted

Source

gunnaraasen

All 27 comments

can you provide your configuration?

sparrc on 14 Jun 2016

@sparrc I get it while running - https://github.com/deis/monitor/tree/master/telegraf

sstarcher on 5 Jul 2016

wegel on 19 Jul 2016

does anyone have steps to reproduce this?

sparrc on 20 Jul 2016

@sparrc my guess is it's related to telegraf running early in the startup process of CoreOS. As I have only reproduced it on newly booted boxes.

sstarcher on 20 Jul 2016

The things I know

I have reproduced it while running telegraf on boot inside a docker container.
I have also reproduced it running telegraf inside of a kubernetes pod as a Daemon set.
Once it happens it never goes away and nothing else is logged
Jumping inside of the container all mounted directories look fine

sstarcher on 20 Jul 2016

My configuration

Setting Agent Hostname to: ip-10-0-20-45.ec2.internal
Setting PROMETHEUS_URLS to: "https://192.168.3.1:443/api/v1/proxy/nodes/ip-10-0-20-45.ec2.internal/metrics", "https://192.168.3.1:443/metrics"
Building config.toml!
Finished building toml...
###########################################
###########################################
# Set Tag Configuration
[tags]
# Set Agent Configuration
[agent]
  interval = "10s"
  round_interval = true
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  debug = false
  quiet = true
  flush_buffer_when_full = true
  hostname = "ip-10-0-20-45.ec2.internal"
# Set output configuration
[[outputs.influxdb]]
  urls = ["http://influxdb.service.consul:8086"]
  database = "telegraf"
  precision = "ns"
  timeout = "5s"




# Set Input Configuration
[[inputs.netstat]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.mem]]
[[inputs.cpu]]
  percpu = true
  totalcpu = true
[[inputs.disk]]

[[inputs.diskio]]


[[inputs.net]]

# Set Service Input Configuration
###########################################
###########################################
2016/07/01 02:46:14 Starting Telegraf (version 1.0.0-beta1-1-g2211231)
2016/07/01 02:46:14 Loaded outputs: influxdb
2016/07/01 02:46:14 Loaded inputs: disk diskio net netstat swap system mem cpu
2016/07/01 02:46:14 Tags enabled: host=ip-10-0-20-45.ec2.internal
2016/07/01 02:46:14 Agent Config: Interval:10s, Debug:false, Quiet:true, Hostname:"ip-10-0-20-45.ec2.internal", Flush Interval:10s
2016/07/01 02:46:20 ERROR in input [disk]: error getting disk usage info: too many levels of symbolic links
2016/07/01 02:46:30 ERROR in input [disk]: error getting disk usage info: too many levels of symbolic links

sstarcher on 20 Jul 2016

@sstarcher the entire telegraf process hangs when this happens?

sparrc on 20 Jul 2016

@sparrc I still get the other inputs shipped so I doubt it's hanging. I could turn debug on and quiet off to see if anything else is happening.

sstarcher on 20 Jul 2016

Disk usage and iNodes are not collected and shipped, but cpu/memory/network is being shipped.

sstarcher on 20 Jul 2016

The first time telegraf is starting I see the following in my journal logs

Jul 01 02:46:20 ip-10-0-20-45.ec2.internal systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 3782 (telegraf) Jul 01 02:46:20 ip-10-0-20-45.ec2.internal systemd[1]: Mounting Arbitrary Executable File Formats File System... Jul 01 02:46:20 ip-10-0-20-45.ec2.internal systemd[1]: Mounted Arbitrary Executable File Formats File System.```

sstarcher on 20 Jul 2016

Just noticed this in my papertrail logs and google brought me here:

Aug 03 15:15:41 localhost fluentd:  2016-08-03T22:15:40Z    fluentd {"log":"2016/08/03 22:15:40 ERROR in input [disk]: error getting disk usage info: too many levels of symbolic links\n","stream":"stderr","docker":{"container_id":"427f86914f90644f9d56770a379cfe5b48f9c7c30598092dade1b40bd8f3830a"},"kubernetes":{"namespace_name":"deis","pod_id":"6164e10f-530c-11e6-a5e2-068755380eff","pod_name":"deis-monitor-telegraf-x5e7f","container_name":"deis-monitor-telegraf","labels":{"app":"deis-monitor-telegraf"},"host":"ip-172-20-0-131.us-west-1.compute.internal"}}

2016/08/03 22:15:40 ERROR in input [disk]: error getting disk usage info: too many levels of symbolic links

Repeats every 10s

olalonde on 4 Aug 2016

I'm seeing the same issue with all telegraf pods created by deis workflow 2.3.0 on kubernetes 1.3.4, 1.0.0-beta2-18-g755b2ec and CoreOS stable as the host os.

I think @sstarcher is on the right track. If I attach to a telegraf container and traverse the rootfs with find I get the following:

find / -type s
find: '/rootfs/proc/sys/fs/binfmt_misc': Too many levels of symbolic links

The error only occurs for /rootfs/proc/sys/fs/binfmt_misc while /proc/sys/fs/binfmt_misc can be listed just fine.

The output of mount shows:

systemd-1 on /rootfs/proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=31,pgrp=0,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11343)

felixbuenemann on 12 Aug 2016

👍1

same as @felixbuenemann

felipejfc on 12 Aug 2016

I am pretty sure that all of the affected _host_ systems (not the containers) are running systemd.

I found this related ubuntu issue: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1555760

Both coreos stable (what I am running) and ubuntu xenial (listed in the report) are using systemd 229.

felixbuenemann on 12 Aug 2016

OK, I think the issue at hand is that the /proc/sys/fs/binfmt_misc is mounted on the host at first access by the automount proxy. But the telegraf container ist started before that with a read-only mount of the host /proc to /rootfs/proc.

If I do ls /proc/sys/fs/binfmt_misc on the coreos host and then kill the telegraf pod so it restarts, the error goes away in the new pod.

felixbuenemann on 12 Aug 2016

I have been able to work around the issue on CoreOS by requiring the docker.service to want the proc-sys-fs-binfmt_misc.mount service, which causes /proc/sys/fs/binfmt_misc on the host to be mounted before docker starts.

This is the cloud-config userdata to define the drop-in:

coreos:
  units:
    - name: docker.service
      drop-ins:
        - name: 60-mount-binfmt-misc.conf
          content: |
            [Unit]
            Wants=proc-sys-fs-binfmt_misc.mount
            After=proc-sys-fs-binfmt_misc.mount

The same could be done on other systems by editing or extending the docker systemd service.

felixbuenemann on 13 Aug 2016

CentOS 7.2, docker 1.12.3, telegraf 1.1.1 (in container)
Same issue.

docker-compose.yml

version: '2'
services:

  telegraf:
    image: telegraf
    environment:
      HOST_PROC: /rootfs/proc
      HOST_SYS: /rootfs/sys
      HOST_ETC: /rootfs/etc
    hostname: host
    volumes:
     - ./telegraf.conf:/etc/telegraf/telegraf.conf:ro
     - /var/run/docker.sock:/var/run/docker.sock:ro
     - /sys:/rootfs/sys:ro
     - /proc:/rootfs/proc:ro
     - /etc:/rootfs/etc:ro

telegraf.conf:

...
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
...

dvdred on 1 Dec 2016

Having the same issue #1544

m4ce on 8 Jan 2017

I personally solved this issue by creating a shell script that wraps telegraf around.

# FIXME: for some reasons, this doesn't happen to work in the entrypoint :(
if [ -n "$HOST_MOUNT_PREFIX" ]; then
  while (mount | grep "$HOST_MOUNT_PREFIX/proc/sys/fs/binfmt_misc" &>/dev/null); do
    # this solves #TELEGRAF-1352 (too many levels of symbolic links)
    umount $HOST_MOUNT_PREFIX/proc/sys/fs/binfmt_misc
  done
fi

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS

I've changed the Dockerfile's CMD statement to use the script and the issue is gone for good.

m4ce on 11 Jan 2017

has anyone been able to confirm that this is the same issue as https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1555760?

sparrc on 11 Jan 2017

Host OS Information
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial

Docker OS Information
Docker Base Image : alpine 3.4

Description
I have experienced the same phenomenon. To explain my situation briefly, the Ubuntu host did not have a problem, and there was a picture-like issue inside the docker container. Telegraf left the following logs.

2017-02-06T15: 32: 40Z E! ERROR in input [inputs.disk]: error getting disk usage info: too many levels of symbolic links

disk issue

Dark0096 on 6 Feb 2017

My input configuration like this

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "overlay"]

[[inputs.diskio]]

There is something strange. Restart the container and the disk collection will be normal. The contents of the container are as follows.

disk issue2

Dark0096 on 6 Feb 2017

@gunnaraasen Can you tell me how you resolved the issue?

Dark0096 on 7 Feb 2017

I found the cause. My case is that the host binfmt_misc is basically an automount so it will mount at the point of use.

sudo systemctl status proc-sys-fs-binfmt_misc.automount

At first, you can see the following logs.

 proc-sys-fs-binfmt_misc.automount - Arbitrary Executable File Formats File System Automount Point
   Loaded: loaded (/lib/systemd/system/proc-sys-fs-binfmt_misc.automount; static; vendor preset: enabled)
   Active: active (waiting) since Mon 2017-02-06 19:24:19 KST; 22h ago
    Where: /proc/sys/fs/binfmt_misc
     Docs: https://www.kernel.org/doc/Documentation/binfmt_misc.txt
           http://www.freedesktop.org/wiki/Software/systemd/APIFileSystems

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

After running telegraf, you can see the following logs.

proc-sys-fs-binfmt_misc.automount - Arbitrary Executable File Formats File System Automount Point
   Loaded: loaded (/lib/systemd/system/proc-sys-fs-binfmt_misc.automount; static; vendor preset: enabled)
   Active: active (running) since Mon 2017-02-06 19:24:20 KST; 22h ago
    Where: /proc/sys/fs/binfmt_misc
     Docs: https://www.kernel.org/doc/Documentation/binfmt_misc.txt
           http://www.freedesktop.org/wiki/Software/systemd/APIFileSystems

Feb 07 00:58:50 ip-xxx systemd[1]: proc-sys-fs-binfmt_misc.automount: Got automount request for /proc/sys/fs/binfmt_misc, triggered by 28579 (telegraf)
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

I've never used binfmt_misc on my host before, which means it will not mount. However, since bingmt_misc is mounted to access mount information from telegraf first, since then volume mapping information is included as well, so telegraf can collect information based on disk information without problems.

Dark0096 on 7 Feb 2017

👍1

same issue, I made a quick fix with umount /proc/sys/fs/binfmt_misc just after telegraf start