Telegraf: can not find docker metrics in influxdb, anyone can help?

Created on 4 Feb 2016 · 34Comments · Source: influxdata/telegraf

I start telegraf with the following config

$ more /etc/telegraf/telegraf.d/docker.conf
[[inputs.docker]]
  # Docker Endpoint
  #   To use TCP, set endpoint = "tcp://[ip]:[port]"
  #   To use environment variables (ie, docker-machine), set endpoint = "ENV"
  endpoint = "unix:///var/run/docker.sock"
  # Only collect metrics for these containers, collect all if empty
  container_names = []

and start telegraf by following command:

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

but can not find any measurements about docker in influxdb:

> show measurements;
name: measurements
------------------
name
cpu
disk
diskio
mem
swap
system

actually, I can see docker datas collected by telegraf

$/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d  -input-filter docker -test
* Plugin: docker, Collection 1
......
> docker_cpu,com.docker.compose.config-hash=2db93f17fb0fdbb2b3be408209d18ac7eb9f44d787af2df58b6f6601771763cf,com.docker.compose.container-number=1,com.docker.compose.oneoff=False,com.docker.compose.project=grafana,com.docker.compose.service=grafana,com.docker.compose.version=1.5.1,cont_id=528cfa640ba2863df3febd0cd28b173527599b8c2d81a26c6965fc3b13b0ea2d,cont_image=grafana/grafana,cont_name=grafana_grafana_1,cpu=cpu1 usage_total=8040078470i 1454561313911620608
......

bug

Source

asdfsx

Most helpful comment

@sporokh I understand you're also hitting the issue, right? I'd like to have a PR ready by the end of the week...although cannot really promise, little bit short on time this week, but I'll try.

tripledes on 15 Mar 2016

👍2

All 34 comments

I finally fix this problem~~~~

asdfsx on 4 Feb 2016

@asdfsx what was the issue/solution?

sparrc on 4 Feb 2016

@sparrc nothing but restart telegraf several times, then it appeared in influxdb......but I think there is still some problems remain.
Right now I'm trying to use grafana to display the metrics, but I can only see one point on the graph, not a line.
caf93ae9-9195-460e-897e-5e75d7af2983
I'm confused by this.....

and when I execute sql:

SELECT mean(usage_total) FROM docker_cpu WHERE time > now() - 1h AND host = 'mesos36' GROUP BY time(15m)

I only get one record
50e697e2-34e1-44fa-870a-4a37c89d411e

asdfsx on 4 Feb 2016

It's so strange that, when I start telegraf by exec the command:

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -debug

I can get docker data in influxdb.
Then I create a graph in grafana, by using sql like below

SELECT mean("usage_total") FROM "docker_cpu" WHERE "host" = 'mesos36' AND $timeFilter GROUP BY time($interval) fill(null)

I get a graph like below
b73ccc6d-d1b7-415a-932b-83f6e883a399
It seems docker_cpu.usage_total is continue growing~
Seems not right.

And when I start telegraf by using systemctl start telegraf, it seems nothing about docker send to the influxdb~~~

SELECT mean(usage_total) FROM docker_cpu WHERE host = 'mesos36' AND time > now() - 1h GROUP BY time(5m)

docker_cpu
time    mean
2016-02-04T07:25:00Z    
2016-02-04T07:30:00Z    
2016-02-04T07:35:00Z    
2016-02-04T07:40:00Z    9535640409061.729
2016-02-04T07:45:00Z    
2016-02-04T07:50:00Z    
2016-02-04T07:55:00Z    9541001545054.232
2016-02-04T08:00:00Z    9542674282644.1
2016-02-04T08:05:00Z    9544248412986.191
2016-02-04T08:10:00Z    
2016-02-04T08:15:00Z    
2016-02-04T08:20:00Z    
2016-02-04T08:25:00Z

asdfsx on 4 Feb 2016

might be a permissions issue, try

sudo -u telegraf /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -debug

The upward slope is normal. Docker's CPU "usage" metric is actually just a counter of CPU ticks used.

sparrc on 4 Feb 2016

I think you are right!
I just notice that the telegraf is running under telegraf account.
So I modify the /etc/systemd/system/telegraf.service

[Service]
EnvironmentFile=-/etc/default/telegraf
#User=telegraf
User=root
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d ${TELEGRAF_OPTS}
Restart=on-failure
KillMode=process

You can see the telegraf was started by user telegraf before
And now I can see the new data!

BTW. You said that Docker's CPU "usage" metric is actually just a counter of CPU ticks used.
Does that mean I don't need to use mean function?
Just query data like this:

SELECT "usage_total" FROM "docker_cpu" WHERE "host" = 'mesos36'

or I need some function else, like count,sum?

asdfsx on 5 Feb 2016

yes, that query would be fine

sparrc on 5 Feb 2016

sudo su telegraf -c '/usr/bin/telegraf -config telegraf.conf -test -filter docker'

works fine. However, the service fails to send the docker metrics and the log fills with multiple instances of

Error getting docker stats: io: read/write on closed pipe

The default permission on the unix socket is 660 (UID:root, GID:docker), and I've added user telegraf to the docker group as well. @sparrc Any idea what's going wrong?

acherunilam on 19 Feb 2016

@sparrc I'm seeing the same issue with v0.10.3-1:

Feb 23 09:42:21 dev112-12 docker[879]: 2016/02/23 08:42:21 Error getting docker stats: io: read/write on closed pipe
Feb 23 09:42:25 dev112-12 docker[879]: 2016/02/23 08:42:25 Error getting docker stats: read unix @->/var/run/docker.sock: i/o timeout
Feb 23 09:42:25 dev112-12 docker[879]: 2016/02/23 08:42:25 Error getting docker stats: read unix @->/var/run/docker.sock: i/o timeout

However, % docker ps; works just fine.

zstyblik on 23 Feb 2016

I think this is most likely related to Docker Version as I'm not seeing it at hosts with Docker version 1.8.3, build f4bf5c7, but I'm seeing it at Docker version 1.9.1, build 4419fdb-dirty.

zstyblik on 23 Feb 2016

I've written small app which is scaled down plugin from Telegraf and I'm getting the same error/seeing the same issue even at Host with Docker v1.8.3. However, I can see "requests" being made in logs:

Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.054869742+01:00" level=info msg="GET /containers/json"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.056860514+01:00" level=info msg="GET /containers/6b864d4d17e370abeff82dc0bb6553905f161fc2ec3b8b2e5998ee9bd637f166/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.057260630+01:00" level=info msg="GET /containers/616487a45616594a2ca671bd0a6f5691cd71fc2c7eee7dfd85cd6f4d6949e0f1/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.057576478+01:00" level=info msg="GET /containers/6242042c2f252ab5225f0173090cf37dedda8c18cf2de5f28ed52ce57c22d69c/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.058421662+01:00" level=info msg="GET /containers/c618e64f04f5d9920119b70477d38d76b4761a1c2a8a92ce704e024d231c4dd1/stats?stream=false"

Origin of the message is https://github.com/influxdata/telegraf/blob/master/plugins/inputs/docker/docker.go#L119

I have no idea how to fix this issue, though. I have found no way to increase timeout value or anything related to such setting. It's possible, though, I haven't dug deep enough.

zstyblik on 23 Feb 2016

@zstyblik the docker plugin has a hardcoded timeout of 5s, https://github.com/influxdata/telegraf/blob/master/plugins/inputs/docker/docker.go#L108-L114 which I believe should be more than enough.

I think that should be a configuration setting, but that's a topic for another discussion.

tripledes on 23 Feb 2016

@tripledes unfortunately, this setting isn't related to the issue.

zstyblik on 23 Feb 2016

@zstyblik obviously, a closed pipe has nothing to do with a timeout, but you suggested to increase the timeout and I just provided information regarding it being hardcoded :grey_question:

tripledes on 23 Feb 2016

Seems like the closed pipe is a side effect of the timeout over docker socket. Looks to me like it might be some synchronisation issue in dockerClient, but just a guess.

On the other hand, I've been looking how docker stats works because it doesn't fail, or at least it doesn't report any issue. The difference is that they use https://github.com/docker/engine-api as a client library. If I get the time I'll try to do a POC just to see if I can reproduce the issue with engine-api.

tripledes on 23 Feb 2016

@sparrc should we keep this opened ? As the issue can be reproduced, I believe it should be opened until a fix is found.

tripledes on 24 Feb 2016

yep, sure

sparrc on 24 Feb 2016

First attempt to switch to Docker's engine-api, if anyone is willing to test it, it's here:

https://github.com/zooplus/telegraf/tree/docker_engine_api

Besides having better compatibility I think one of the advantages of using engine-api is that they use context for all request so they can handle failure better.

I'd be very glad to have some feedback, I tried to keep the output as it was before but the following items would need some love:

Better error handling (probably using a shared error channel amongst all goroutines)
Unit tests

And whether possible, I'd like to make the plugin a bit more flexible, using jsonflattener? So we don't need to specify all the metrics upfront. But I guess this could be left for follow-ups.

@sparrc what are your thoughts about the change? I think it could also be done with Go's std lib, but would require a bigger effort to have the same functionalities (context, api version compatibilities, ...).

tripledes on 25 Feb 2016

@tripledes I don't have time to test but this sounds fine with me.

There is also a PR up for improving some of the docker metrics: https://github.com/influxdata/telegraf/pull/754, how does that fit in?

sparrc on 26 Feb 2016

@sparrc I currently have an instance of Telegraf with my changes running on our test env, no issues for now, just some blkio metric names that I need to check...other than that it's running fine, still some feedback from anyone involved on this issue would be very much appreciated :) @asdfsx @adithyabenny

Regarding #754, I just had a quick look and I don't think it'd be an issue, I could reapply my changes on top of it once you get it merged.

tripledes on 26 Feb 2016

@tripledes if you make any change to the telegraf for this issue, I'd like to try

asdfsx on 29 Feb 2016

@asdfsx here: https://github.com/zooplus/telegraf/tree/docker_engine_api you'd need to compile it yourself. I could provide a compiled binary if needed.

tripledes on 29 Feb 2016

@tripledes I just compile it on ubuntu, and run it via the following command
sudo /home/ubuntu/go/bin/telegraf -config /home/ubuntu/telegraf/telegraf.conf -debug -test -filter docker
It seems ok right now.
centos seems ok too!
Anything else need to be tested? Please tell me!

asdfsx on 1 Mar 2016

@asdfsx thanks! Just let us know whether you find any issue so it can be fixed before submitting a PR.

tripledes on 1 Mar 2016

any changes considering this issue?

sporokh on 14 Mar 2016

@sporokh I understand you're also hitting the issue, right? I'd like to have a PR ready by the end of the week...although cannot really promise, little bit short on time this week, but I'll try.

tripledes on 15 Mar 2016

👍2

@tripledes Thanks a lot Sergio!
We have the same issue on our staging server, the metrics being collected but I consistently receive this error in my logs.

2016/02/23 08:42:21 Error getting docker stats: io: read/write on closed pipe

sporokh on 15 Mar 2016

@tripledes any possibilities of a PR by the end of this week?

sparrc on 30 Mar 2016

@sparrc sorry, little short on time lately, I'll try over the weekend. In case I don't manage to get the time I'll ping u back.

tripledes on 31 Mar 2016

@sparrc Just finished modifying the input, haven't done anything on the tests yet and just run a manual test, although looking promising.

I'll get to the tests tomorrow in the meantime anyone willing to test ?

https://github.com/tripledes/telegraf/tree/engine-api

Feedback welcome :+1:

tripledes on 3 Apr 2016

thank you @tripledes, this has worked well for me

sparrc on 6 Apr 2016

@sparrc glad to hear! I'd like to have a better look to the input plugin whenever I get a bit of time (quite busy lately at work), as I think it should be checking for API version and also to have some kind of integration tests against the supported docker api versions. Just some ideas.

tripledes on 6 Apr 2016

Check the syslogs (tail -f /var/log/syslog). If the error is

Error in plugin [inputs.docker]: Got permission denied while trying to connect to the Docker daemon...

then you have to add telegraf user to docker group, as explained here:

$ sudo usermod -aG docker telegraf

forzagreen on 4 Dec 2017

👍1

To anyone looking for a solution on ARM based architecture...

As root open the cmdline.txt file...
$ sudo nano /boot/firmware/cmdline.txt

Add the following to the end of the file...
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1

Reboot the system...
$ sudo reboot

Verify that the changes have worked!
$ docker stats

Hope this helps.

remiteeple on 9 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Config test with input.exec and sudo

Xiol · 3Comments

[inputs.exec] Python command ModuleNotFoundError

robert-gomes · 3Comments

How do I select from InfluxDB what hosts are in the DB?

mabushey · 3Comments

Input plugin for exim

fahimeh2010 · 3Comments

Add a prometheus pushgateway input module

IxDay · 3Comments