Hello,
I'm still getting these errors in metricbeat logs with version 6.2.3 when deployed as docker container:
2018-03-21T10:10:07.539Z ERROR instance/metrics.go:69 Error while getting memory usage: error retrieving process stats
2018-03-21T10:10:07.539Z ERROR instance/metrics.go:113 Error retrieving CPU percentages: error retrieving process stats
but there is no clue what could be the problem. I'm running an official metricbeat docker image and trying to pull stats from host.
The metricbeat container is run using this command:
docker run -d --restart=always --name metricbeat \
--net=host \
-u root \
-v /proc:/hostfs/proc:ro \
-v /sys/fs/cgroup:/hostfs/sys/fs/cgroup:ro \
-v /:/hostfs:ro \
-v /var/run/docker.sock:/var/run/docker.sock \
metricbeat:6.2.3 -system.hostfs=/hostfs
cat system.yml
- module: system
period: 10s
metricsets:
- cpu
- load
- memory
- network
- process
- process_summary
#- core
- diskio
#- socket
processes: ['.*']
process.include_top_n:
by_cpu: 5 # include top 5 processes by CPU
by_memory: 5 # include top 5 processes by memory
- module: system
period: 1m
metricsets:
- filesystem
- fsstat
processors:
- drop_event.when.regexp:
system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'
Any idea what could be the cause ?
My initial guess is that it cannot find its own metrics in /hostfs/proc because it is pid namespaced and therefore is looking for something like 2 but when on the host the PID is very different.
This problem extends from the fact that /hostfs is treated as a global variable. This causes all metrics collecting code to read from /hostfs/proc. But the self monitoring metrics should always come from /proc.
@kvch Can you try to reproduce this? I think we need a test case for this, and we can discuss some possible solutions.
@Constantin07 Just so we are on the same page I should clarify that those log messages indicate a problem with the self-monitoring feature that was added in 6.2 that let's the Beat report it's own CPU/memory/load information in the log (with a message of "Non-zero metrics in the last 30s") and to X-Pack Monitoring if configured.
Your regular metrics from the system module should not be affected.
@andrewkroh I can reproduce problem.
We definitely need a new test case for this. I think the problem should be handled in gosigar. However, I am not yet sure how it should be done.
How about #6641 fix?
It is just enough to run metricbeat with system.hostfs argument to reproduce problem, it is not related to docker itself.
I think a general solution without modifying gosigar could be to obtain the pid of the process from <sigar.Procd>/self/status (Pid field), instead of using os.GetPid(). This would work both with or without namespaces.
@andrewkroh yes, you are right. I do see in Elasticsearch regular metrics but the error log message to me appears misleading (it's kind of "it works" nut not completely).
@jsoriano can you give more details about general solution?
In hostfs proc dir we have host processes, /hostfs/proc/self/status - it will be not an metricbeat process.
I will try later to check myself, but for now on I don't understand how it will work.
@ewgRa host proc dir contains all processes running in any PID namespace of the machine, this includes the namespace in which the metricbeat process runs. The special file self always refers to the process in the namespace of the procfs mount, so if metricbeat reads <sigar.Procd>/self/status (with host /proc mounted in sigar.Procd) it will see its status in the host namespace, what includes the pid in this namespace, that could be used in calls to gosigar without changing sigar.Procd.
I say it could be a general solution because it'd also work when no namespacing is used: sigar.Procd would be /proc and this would contain the process as usual.
self behaviour in different namespaces is documented in pid_namespaces man page:
Calling readlink(2) on the path /proc/self yields the process ID of
the caller in the PID namespace of the procfs mount (i.e., the PID
namespace of the process that mounted the procfs). This can be use‐
ful for introspection purposes, when a process wants to discover its
PID in other namespaces.
@jsoriano thanks for brilliant idea, I made changes, it works, close to magic level :) Can you review it again? Failed CI looks like not related to my changes.
I see only two problems/limitations from this solution:
Hard to write test for this case, I've add simple test, that actually test not a real case itself.
In case if there will be wrong -system.hostfs flag (directory not exists for example), this way will fail to get metrics.
But I think this is acceptable edge cases.
Fixed by #6641
Thanks @ewgRa @jsoriano
Thanks
Is there a workaround for this?
I am running 6.2.4 and issue still persist.
+1 still seeing:
2018-10-30T18:47:37.898Z ERROR instance/metrics.go:92 Error retrieving CPU percentages: error retrieving process stats: cannot find matching process for pid=1
2018-10-30T18:47:37.898Z ERROR instance/metrics.go:61 Error while getting memory usage: error retrieving process stats: cannot find matching process for pid=1

I added hostPid: true in accordance with: https://github.com/elastic/beats/issues/6734

Edit: I just realized that's not where it goes. Moved in in accordance with https://kubernetes.io/docs/concepts/policy/pod-security-policy/. Still no dice. We see the same errors in the logs as before.
@grantcurell what version of metricbeat are you using? could you also share the configuration you are using to start it?
And what logs do you see with hostPid: true? They shouldn' tbe the same as metricbeat won't have pid 1.
Update: It took me a second to get it in the right place, but
Metricbeat is 6.2.4. The logs are clean now and I'm not receiving the error, but the behavior of the dashboard is erratic and I'm not sure why.
@grantcurell what version of metricbeat are you using? could you also share the configuration you are using to start it?
And what logs do you see withhostPid: true? They shouldn' tbe the same as metricbeat won't have pid 1.
Update: Took me a second to get it in the right place and I'm no longer receiving the error in the logs. What is strange is the dashboard behaves very erratically and the data is incorrect. For example: I'm doing a controlled test where I'm pumping 5Gb/s into a security sensor I have, am confirming with traditional monitoring tools that the sensor is receiving the expected 5Gb/s (4.65 to be exact with the loss from overhead), but Metricbeat's reading for inbound traffic jumps around all over the place. Anywhere from 17MB/s to 120MB/s and it changes on each 5 second interval I have the dashboard set to.
The other problem I have can be seen below. If I set the time period to anything less than 30 minutes the entire top part of the dashboard zeros out, but the accompanying data continues to display appropriately - including network speed which you can see is appropriately sitting at 600MB/s (4800 Mb/s).
Additional Info: This is running on Kubernetes 1.9.7

You can see below when I change the time to 30 minutes it displays. Though the disk IO is still missing.
Edit: And the information is still inaccurate in general.

@grantcurell thanks for all the details.
6.2.4 didn't include yet the fix for this, you need 6.3.0 or later, in any case the hostPid: true workaround should work.
Regarding the other problems, it'd be great if you could confirm them with a more modern metricbeat version and open specific issues.
@grantcurell thanks for all the details.
6.2.4 didn't include yet the fix for this, you need 6.3.0 or later, in any case the
hostPid: trueworkaround should work.Regarding the other problems, it'd be great if you could confirm them with a more modern metricbeat version and open specific issues.
Can I update metricbeat independently of Elasticsearch in this case?
Can I update metricbeat independently of Elasticsearch in this case?
If you are using Elasticsearch 6.X this should be fine, check the product compatibility matrix.
@jsoriano upgrading to Metricbeat 6.4.2 didn't fix the problem. I still get a bunch of strange partial data if the time interval is anything less than 30 minutes. Ex the Kubernetes dashboard:

or

But move it to 30 minutes and you get:
