I start telegraf with the following config
$ more /etc/telegraf/telegraf.d/docker.conf
[[inputs.docker]]
# Docker Endpoint
# To use TCP, set endpoint = "tcp://[ip]:[port]"
# To use environment variables (ie, docker-machine), set endpoint = "ENV"
endpoint = "unix:///var/run/docker.sock"
# Only collect metrics for these containers, collect all if empty
container_names = []
and start telegraf by following command:
/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
but can not find any measurements about docker in influxdb:
> show measurements;
name: measurements
------------------
name
cpu
disk
diskio
mem
swap
system
actually, I can see docker datas collected by telegraf
$/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -input-filter docker -test
* Plugin: docker, Collection 1
......
> docker_cpu,com.docker.compose.config-hash=2db93f17fb0fdbb2b3be408209d18ac7eb9f44d787af2df58b6f6601771763cf,com.docker.compose.container-number=1,com.docker.compose.oneoff=False,com.docker.compose.project=grafana,com.docker.compose.service=grafana,com.docker.compose.version=1.5.1,cont_id=528cfa640ba2863df3febd0cd28b173527599b8c2d81a26c6965fc3b13b0ea2d,cont_image=grafana/grafana,cont_name=grafana_grafana_1,cpu=cpu1 usage_total=8040078470i 1454561313911620608
......
I finally fix this problem~~~~
@asdfsx what was the issue/solution?
@sparrc nothing but restart telegraf several times, then it appeared in influxdb......but I think there is still some problems remain.
Right now I'm trying to use grafana to display the metrics, but I can only see one point on the graph, not a line.

I'm confused by this.....
and when I execute sql:
SELECT mean(usage_total) FROM docker_cpu WHERE time > now() - 1h AND host = 'mesos36' GROUP BY time(15m)
I only get one record

It's so strange that, when I start telegraf by exec the command:
/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -debug
I can get docker data in influxdb.
Then I create a graph in grafana, by using sql like below
SELECT mean("usage_total") FROM "docker_cpu" WHERE "host" = 'mesos36' AND $timeFilter GROUP BY time($interval) fill(null)
I get a graph like below

It seems docker_cpu.usage_total is continue growing~
Seems not right.
And when I start telegraf by using systemctl start telegraf, it seems nothing about docker send to the influxdb~~~
SELECT mean(usage_total) FROM docker_cpu WHERE host = 'mesos36' AND time > now() - 1h GROUP BY time(5m)
docker_cpu
time mean
2016-02-04T07:25:00Z
2016-02-04T07:30:00Z
2016-02-04T07:35:00Z
2016-02-04T07:40:00Z 9535640409061.729
2016-02-04T07:45:00Z
2016-02-04T07:50:00Z
2016-02-04T07:55:00Z 9541001545054.232
2016-02-04T08:00:00Z 9542674282644.1
2016-02-04T08:05:00Z 9544248412986.191
2016-02-04T08:10:00Z
2016-02-04T08:15:00Z
2016-02-04T08:20:00Z
2016-02-04T08:25:00Z
might be a permissions issue, try
sudo -u telegraf /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d -debug
The upward slope is normal. Docker's CPU "usage" metric is actually just a counter of CPU ticks used.
I think you are right!
I just notice that the telegraf is running under telegraf account.
So I modify the /etc/systemd/system/telegraf.service
[Service]
EnvironmentFile=-/etc/default/telegraf
#User=telegraf
User=root
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d ${TELEGRAF_OPTS}
Restart=on-failure
KillMode=process
You can see the telegraf was started by user telegraf before
And now I can see the new data!
BTW. You said that Docker's CPU "usage" metric is actually just a counter of CPU ticks used.
Does that mean I don't need to use mean function?
Just query data like this:
SELECT "usage_total" FROM "docker_cpu" WHERE "host" = 'mesos36'
or I need some function else, like count,sum?
yes, that query would be fine
sudo su telegraf -c '/usr/bin/telegraf -config telegraf.conf -test -filter docker'
works fine. However, the service fails to send the docker metrics and the log fills with multiple instances of
Error getting docker stats: io: read/write on closed pipe
The default permission on the unix socket is 660 (UID:root, GID:docker), and I've added user telegraf to the docker group as well. @sparrc Any idea what's going wrong?
@sparrc I'm seeing the same issue with v0.10.3-1:
Feb 23 09:42:21 dev112-12 docker[879]: 2016/02/23 08:42:21 Error getting docker stats: io: read/write on closed pipe
Feb 23 09:42:25 dev112-12 docker[879]: 2016/02/23 08:42:25 Error getting docker stats: read unix @->/var/run/docker.sock: i/o timeout
Feb 23 09:42:25 dev112-12 docker[879]: 2016/02/23 08:42:25 Error getting docker stats: read unix @->/var/run/docker.sock: i/o timeout
However, % docker ps; works just fine.
I think this is most likely related to Docker Version as I'm not seeing it at hosts with Docker version 1.8.3, build f4bf5c7, but I'm seeing it at Docker version 1.9.1, build 4419fdb-dirty.
I've written small app which is scaled down plugin from Telegraf and I'm getting the same error/seeing the same issue even at Host with Docker v1.8.3. However, I can see "requests" being made in logs:
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.054869742+01:00" level=info msg="GET /containers/json"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.056860514+01:00" level=info msg="GET /containers/6b864d4d17e370abeff82dc0bb6553905f161fc2ec3b8b2e5998ee9bd637f166/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.057260630+01:00" level=info msg="GET /containers/616487a45616594a2ca671bd0a6f5691cd71fc2c7eee7dfd85cd6f4d6949e0f1/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.057576478+01:00" level=info msg="GET /containers/6242042c2f252ab5225f0173090cf37dedda8c18cf2de5f28ed52ce57c22d69c/stats?stream=false"
Feb 23 11:27:30 builder docker[10341]: time="2016-02-23T11:27:30.058421662+01:00" level=info msg="GET /containers/c618e64f04f5d9920119b70477d38d76b4761a1c2a8a92ce704e024d231c4dd1/stats?stream=false"
Origin of the message is https://github.com/influxdata/telegraf/blob/master/plugins/inputs/docker/docker.go#L119
I have no idea how to fix this issue, though. I have found no way to increase timeout value or anything related to such setting. It's possible, though, I haven't dug deep enough.
@zstyblik the docker plugin has a hardcoded timeout of 5s, https://github.com/influxdata/telegraf/blob/master/plugins/inputs/docker/docker.go#L108-L114 which I believe should be more than enough.
I think that should be a configuration setting, but that's a topic for another discussion.
@tripledes unfortunately, this setting isn't related to the issue.
@zstyblik obviously, a closed pipe has nothing to do with a timeout, but you suggested to increase the timeout and I just provided information regarding it being hardcoded :grey_question:
Seems like the closed pipe is a side effect of the timeout over docker socket. Looks to me like it might be some synchronisation issue in dockerClient, but just a guess.
On the other hand, I've been looking how docker stats works because it doesn't fail, or at least it doesn't report any issue. The difference is that they use https://github.com/docker/engine-api as a client library. If I get the time I'll try to do a POC just to see if I can reproduce the issue with engine-api.
@sparrc should we keep this opened ? As the issue can be reproduced, I believe it should be opened until a fix is found.
yep, sure
First attempt to switch to Docker's engine-api, if anyone is willing to test it, it's here:
https://github.com/zooplus/telegraf/tree/docker_engine_api
Besides having better compatibility I think one of the advantages of using engine-api is that they use context for all request so they can handle failure better.
I'd be very glad to have some feedback, I tried to keep the output as it was before but the following items would need some love:
And whether possible, I'd like to make the plugin a bit more flexible, using jsonflattener? So we don't need to specify all the metrics upfront. But I guess this could be left for follow-ups.
@sparrc what are your thoughts about the change? I think it could also be done with Go's std lib, but would require a bigger effort to have the same functionalities (context, api version compatibilities, ...).
@tripledes I don't have time to test but this sounds fine with me.
There is also a PR up for improving some of the docker metrics: https://github.com/influxdata/telegraf/pull/754, how does that fit in?
@sparrc I currently have an instance of Telegraf with my changes running on our test env, no issues for now, just some blkio metric names that I need to check...other than that it's running fine, still some feedback from anyone involved on this issue would be very much appreciated :) @asdfsx @adithyabenny
Regarding #754, I just had a quick look and I don't think it'd be an issue, I could reapply my changes on top of it once you get it merged.
@tripledes if you make any change to the telegraf for this issue, I'd like to try
@asdfsx here: https://github.com/zooplus/telegraf/tree/docker_engine_api you'd need to compile it yourself. I could provide a compiled binary if needed.
@tripledes I just compile it on ubuntu, and run it via the following command
sudo /home/ubuntu/go/bin/telegraf -config /home/ubuntu/telegraf/telegraf.conf -debug -test -filter docker
It seems ok right now.
centos seems ok too!
Anything else need to be tested? Please tell me!
@asdfsx thanks! Just let us know whether you find any issue so it can be fixed before submitting a PR.
any changes considering this issue?
@sporokh I understand you're also hitting the issue, right? I'd like to have a PR ready by the end of the week...although cannot really promise, little bit short on time this week, but I'll try.
@tripledes Thanks a lot Sergio!
We have the same issue on our staging server, the metrics being collected but I consistently receive this error in my logs.
2016/02/23 08:42:21 Error getting docker stats: io: read/write on closed pipe
@tripledes any possibilities of a PR by the end of this week?
@sparrc sorry, little short on time lately, I'll try over the weekend. In case I don't manage to get the time I'll ping u back.
@sparrc Just finished modifying the input, haven't done anything on the tests yet and just run a manual test, although looking promising.
I'll get to the tests tomorrow in the meantime anyone willing to test ?
https://github.com/tripledes/telegraf/tree/engine-api
Feedback welcome :+1:
thank you @tripledes, this has worked well for me
@sparrc glad to hear! I'd like to have a better look to the input plugin whenever I get a bit of time (quite busy lately at work), as I think it should be checking for API version and also to have some kind of integration tests against the supported docker api versions. Just some ideas.
Check the syslogs (tail -f /var/log/syslog). If the error is
Error in plugin [inputs.docker]: Got permission denied while trying to connect to the Docker daemon...
then you have to add telegraf user to docker group, as explained here:
$ sudo usermod -aG docker telegraf
To anyone looking for a solution on ARM based architecture...
As root open the cmdline.txt file...
$ sudo nano /boot/firmware/cmdline.txt
Add the following to the end of the file...
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
Reboot the system...
$ sudo reboot
Verify that the changes have worked!
$ docker stats
Hope this helps.
Most helpful comment
@sporokh I understand you're also hitting the issue, right? I'd like to have a PR ready by the end of the week...although cannot really promise, little bit short on time this week, but I'll try.