Beats: [Metricbeat] Monitoring metrics don't work when containerised

Created on 21 Mar 2018 · 21Comments · Source: elastic/beats

Hello,

I'm still getting these errors in metricbeat logs with version 6.2.3 when deployed as docker container:

2018-03-21T10:10:07.539Z    ERROR   instance/metrics.go:69  Error while getting memory usage: error retrieving process stats
2018-03-21T10:10:07.539Z    ERROR   instance/metrics.go:113 Error retrieving CPU percentages: error retrieving process stats

but there is no clue what could be the problem. I'm running an official metricbeat docker image and trying to pull stats from host.

The metricbeat container is run using this command:

docker run -d --restart=always --name metricbeat \
--net=host \
-u root \
-v /proc:/hostfs/proc:ro \
-v /sys/fs/cgroup:/hostfs/sys/fs/cgroup:ro \
-v /:/hostfs:ro \
-v /var/run/docker.sock:/var/run/docker.sock \
metricbeat:6.2.3 -system.hostfs=/hostfs

cat system.yml

- module: system
  period: 10s
  metricsets:
    - cpu
    - load
    - memory
    - network
    - process
    - process_summary
    #- core
    - diskio
    #- socket
  processes: ['.*']
  process.include_top_n:
    by_cpu: 5      # include top 5 processes by CPU
    by_memory: 5   # include top 5 processes by memory

- module: system
  period: 1m
  metricsets:
    - filesystem
    - fsstat
  processors:
  - drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'

Metricbeat version: 6.2.3
Operating System: CoreOS Linux 1465.8.0
Docker Client 1.10.3 (API 1.22), Docker Server 1.12.6 (API 1.24)

Any idea what could be the cause ?

bug libbeat

Source

Constantin07

All 21 comments

My initial guess is that it cannot find its own metrics in /hostfs/proc because it is pid namespaced and therefore is looking for something like 2 but when on the host the PID is very different.

This problem extends from the fact that /hostfs is treated as a global variable. This causes all metrics collecting code to read from /hostfs/proc. But the self monitoring metrics should always come from /proc.

@kvch Can you try to reproduce this? I think we need a test case for this, and we can discuss some possible solutions.

andrewkroh on 21 Mar 2018

@Constantin07 Just so we are on the same page I should clarify that those log messages indicate a problem with the self-monitoring feature that was added in 6.2 that let's the Beat report it's own CPU/memory/load information in the log (with a message of "Non-zero metrics in the last 30s") and to X-Pack Monitoring if configured.

Your regular metrics from the system module should not be affected.

andrewkroh on 21 Mar 2018

@andrewkroh I can reproduce problem.
We definitely need a new test case for this. I think the problem should be handled in gosigar. However, I am not yet sure how it should be done.

kvch on 21 Mar 2018

How about #6641 fix?

It is just enough to run metricbeat with system.hostfs argument to reproduce problem, it is not related to docker itself.

ewgRa on 22 Mar 2018

I think a general solution without modifying gosigar could be to obtain the pid of the process from <sigar.Procd>/self/status (Pid field), instead of using os.GetPid(). This would work both with or without namespaces.

jsoriano on 23 Mar 2018

@andrewkroh yes, you are right. I do see in Elasticsearch regular metrics but the error log message to me appears misleading (it's kind of "it works" nut not completely).

Constantin07 on 23 Mar 2018

@jsoriano can you give more details about general solution?

In hostfs proc dir we have host processes, /hostfs/proc/self/status - it will be not an metricbeat process.

I will try later to check myself, but for now on I don't understand how it will work.

ewgRa on 23 Mar 2018

@ewgRa host proc dir contains all processes running in any PID namespace of the machine, this includes the namespace in which the metricbeat process runs. The special file self always refers to the process in the namespace of the procfs mount, so if metricbeat reads <sigar.Procd>/self/status (with host /proc mounted in sigar.Procd) it will see its status in the host namespace, what includes the pid in this namespace, that could be used in calls to gosigar without changing sigar.Procd.

I say it could be a general solution because it'd also work when no namespacing is used: sigar.Procd would be /proc and this would contain the process as usual.

self behaviour in different namespaces is documented in pid_namespaces man page:

       Calling readlink(2) on the path /proc/self yields the process ID of
       the caller in the PID namespace of the procfs mount (i.e., the PID
       namespace of the process that mounted the procfs).  This can be use‐
       ful for introspection purposes, when a process wants to discover its
       PID in other namespaces.

jsoriano on 23 Mar 2018

👍1

@jsoriano thanks for brilliant idea, I made changes, it works, close to magic level :) Can you review it again? Failed CI looks like not related to my changes.

I see only two problems/limitations from this solution:

Hard to write test for this case, I've add simple test, that actually test not a real case itself.
In case if there will be wrong -system.hostfs flag (directory not exists for example), this way will fail to get metrics.

But I think this is acceptable edge cases.

ewgRa on 24 Mar 2018

👍1

Fixed by #6641

jsoriano on 27 Mar 2018

Thanks @ewgRa @jsoriano

Constantin07 on 27 Mar 2018

Thanks

Camillevau on 28 Mar 2018

Is there a workaround for this?

I am running 6.2.4 and issue still persist.

andrzej-majewski on 24 Apr 2018

+1 still seeing:
2018-10-30T18:47:37.898Z ERROR instance/metrics.go:92 Error retrieving CPU percentages: error retrieving process stats: cannot find matching process for pid=1
2018-10-30T18:47:37.898Z ERROR instance/metrics.go:61 Error while getting memory usage: error retrieving process stats: cannot find matching process for pid=1

I added hostPid: true in accordance with: https://github.com/elastic/beats/issues/6734

Edit: I just realized that's not where it goes. Moved in in accordance with https://kubernetes.io/docs/concepts/policy/pod-security-policy/. Still no dice. We see the same errors in the logs as before.

grantcurell on 30 Oct 2018

👍1

@grantcurell what version of metricbeat are you using? could you also share the configuration you are using to start it?
And what logs do you see with hostPid: true? They shouldn' tbe the same as metricbeat won't have pid 1.

jsoriano on 31 Oct 2018

Update: It took me a second to get it in the right place, but

metricbeat-deploy.yml.txt

Metricbeat is 6.2.4. The logs are clean now and I'm not receiving the error, but the behavior of the dashboard is erratic and I'm not sure why.

@grantcurell what version of metricbeat are you using? could you also share the configuration you are using to start it?
And what logs do you see with hostPid: true? They shouldn' tbe the same as metricbeat won't have pid 1.

metricbeat-deploy.yml.txt

Update: Took me a second to get it in the right place and I'm no longer receiving the error in the logs. What is strange is the dashboard behaves very erratically and the data is incorrect. For example: I'm doing a controlled test where I'm pumping 5Gb/s into a security sensor I have, am confirming with traditional monitoring tools that the sensor is receiving the expected 5Gb/s (4.65 to be exact with the loss from overhead), but Metricbeat's reading for inbound traffic jumps around all over the place. Anywhere from 17MB/s to 120MB/s and it changes on each 5 second interval I have the dashboard set to.

The other problem I have can be seen below. If I set the time period to anything less than 30 minutes the entire top part of the dashboard zeros out, but the accompanying data continues to display appropriately - including network speed which you can see is appropriately sitting at 600MB/s (4800 Mb/s).

Additional Info: This is running on Kubernetes 1.9.7

grantcurell on 31 Oct 2018

You can see below when I change the time to 30 minutes it displays. Though the disk IO is still missing.

Edit: And the information is still inaccurate in general.

grantcurell on 31 Oct 2018

@grantcurell thanks for all the details.

6.2.4 didn't include yet the fix for this, you need 6.3.0 or later, in any case the hostPid: true workaround should work.

Regarding the other problems, it'd be great if you could confirm them with a more modern metricbeat version and open specific issues.

jsoriano on 31 Oct 2018

@grantcurell thanks for all the details.

6.2.4 didn't include yet the fix for this, you need 6.3.0 or later, in any case the hostPid: true workaround should work.

Regarding the other problems, it'd be great if you could confirm them with a more modern metricbeat version and open specific issues.

Can I update metricbeat independently of Elasticsearch in this case?

grantcurell on 31 Oct 2018

Can I update metricbeat independently of Elasticsearch in this case?

If you are using Elasticsearch 6.X this should be fine, check the product compatibility matrix.

jsoriano on 1 Nov 2018

@jsoriano upgrading to Metricbeat 6.4.2 didn't fix the problem. I still get a bunch of strange partial data if the time interval is anything less than 30 minutes. Ex the Kubernetes dashboard: