Beats: Failed to publish events: unsupported float value: NaN / Broken Error handling

Created on 18 Mar 2019 · 16Comments · Source: elastic/beats

We have some issues with metricbeat to logstash transport. After some time (can be days after startup), metricbeat starts logging "Failed to publish events: unsupported float value: NaN" every report cycle and repeats the same event forever. That leads to billions of copies of one event, that ist delivered to different logstash instances.

Activated metricbesat modules:

system (cpu, load, core, filesystem, fsstat, memory, network, process, socket)
docker (container, cpu, diskio, healthcheck, image, info, memory, network)
elasticsearch (node, node_stats)
logstash (node, node_stats)
http (json)
redis (info, keyspace)
Our logstash transport is encrypted whith TLS.

Versions:
Oracle Linux Server release 7.6
metricbeat 6.5.4 (running in a docker container)
logstash 6.5.4 (running in a docker container)

https://discuss.elastic.co/t/failed-to-publish-events-unsupported-float-value-nan/169771

The Metric that produces this error is "docker / memory"

I found a way to "reproduce" the behavior. If you restart some docker container over and over again than the NaN Values will occur. But it can take a while. I suspect that this only happens when the module try to fetch the metric while the container goes up or down.

Metricbeat Integrations bug containers module

Source

lusche

Most helpful comment

@jsoriano thanks for the new issue. We like your solution for metrics an logs an we use it in our infrastructure. So I'm happy if i can give samethink back even if its only a report about odd behavior.

At this time our setup is beats -> logstash -> ... so it would take a while to try to reproduce it with other outputs. But it looks you did it yourself. If I can provide more help / information regarding this issue let me know.

lusche on 10 Apr 2019

❤2

All 16 comments

same issue here(metricbeat 6.7.0):

2019-03-29T15:46:05.254+0800    ERROR   elasticsearch/client.go:391     Failed to encode event: unsupported float value: NaN
2019-03-29T15:46:05.254+0800    ERROR   elasticsearch/client.go:391     Failed to encode event: unsupported float value: NaN
2019-03-29T15:46:05.255+0800    ERROR   elasticsearch/client.go:391     Failed to encode event: unsupported float value: NaN
2019-03-29T15:46:05.255+0800    ERROR   elasticsearch/client.go:391     Failed to encode event: unsupported float value: NaN
2019-03-29T15:46:05.255+0800    ERROR   elasticsearch/client.go:391     Failed to encode event: unsupported float value: NaN
2019-03-29T15:46:05.255+0800    ERROR   elasticsearch/client.go:391     Failed to encode event: unsupported float value: NaN
2019-03-29T15:46:05.255+0800    ERROR   elasticsearch/client.go:391     Failed to encode event: unsupported float value: NaN

opsnull on 29 Mar 2019

Thank you for opening @lusche. Could you please enable debugging and report back? I'm trying to discover where the issue is coming from.

You can just enable debugging on what is being published, this way we may see what's the offending field. To do so please run Metricbeat with the following flag: -d publish.

Once you reproduce the issue, failing events should be in the logs for analysis. Please review before pushing them here to avoid leaking any sensitive data

exekias on 4 Apr 2019

I was able to reproduce this by restarting a redis container in a loop. Metricbeat is the current master build.

Apr 05 10:48:57 griffon.nest metricbeat[25578]: 2019-04-05T10:48:57.934-0500        ERROR        elasticsearch/client.go:393        Failed to encode event: unsupported float value: NaN

fearful-symmetry on 5 Apr 2019

With the debug logger on, we also got this message:

Apr 05 10:48:57 griffon.nest metricbeat[25578]: 2019-04-05T10:48:57.934-0500        DEBUG        [elasticsearch]        elasticsearch/client.go:394        Failed event: &{2019-04-05 10:48:56.914486893 -0500 CDT m=+373.037949509 null Not valid json: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value <nil>}

fearful-symmetry on 5 Apr 2019

Compiled with better debugging, got this:

Apr 05 11:06:42 griffon.nest metricbeat[785]: 2019-04-05T11:06:42.322-0500        DEBUG        [elasticsearch]        elasticsearch/client.go:395        Failed event: &beat.Event{Timestamp:time.Time{wall:0xbf21fd23e156f69f, ext:64106699084, loc:(*time.Location)(0x3fe6880)}, Meta:common.MapStr(nil), Fields:common.MapStr{"host":common.MapStr{"os":common.MapStr{"kernel":"4.20.16-200.fc29.x86_64", "platform":"fedora", "version":"29 (Server Edition)", "family":"redhat", "name":"Fedora"}, "id":"065a459cdc604e2dba66ec6f6a9a8136", "containerized":false, "hostname":"griffon.nest", "name":"griffon.nest", "architecture":"x86_64"}, "agent":common.MapStr{"version":"8.0.0", "type":"metricbeat", "ephemeral_id":"d3301b0f-c533-4d7c-a96e-68ae476722cd", "hostname":"griffon.nest", "id":"d867dd8d-e193-497b-9519-f1ef5624c3c1"}, "container":common.MapStr{"image":common.MapStr{"name":"redis"}, "runtime":"docker", "id":"eed32ff6bfc0519b3284bd4773a8b006edc012e2110d06a2c5a5b79cbca978fa", "name":"focused_mirzakhani"}, "event":common.MapStr{"dataset":"docker.memory", "module":"docker", "duration":1761909056}, "metricset":common.MapStr{"name":"memory"}, "service":common.MapStr{"address":"/var/run/docker.sock", "type":"docker"}, "docker":common.MapStr{"memory":common.MapStr{"fail":common.MapStr{"count":0x0}, "limit":0x0, "rss":common.MapStr{"total":0x0, "pct":NaN}, "usage":common.MapStr{"pct":NaN, "max":0x0, "total":0x0}}}, "ecs":common.MapStr{"version":"1.0.0"}}, Private:interface {}(nil)}

The bad data is here:

"rss":common.MapStr{"total":0x0, "pct":NaN}, "usage":common.MapStr{"pct":NaN, "max":0x0, "total":0x0}

Which is kind of what I expected, as we do some division math to get those values.
https://github.com/elastic/beats/blob/ebdf66d8ef6299a9c9a7633badd14ed73e87d501/metricbeat/module/docker/memory/helper.go#L57

I played around a bit, and docker stats can be kind of weird if you run it while a container was restarting. I was able to get it to throw a 0 value for Limit which is how we're doing the above division:

[root@griffon metricbeat]# while true; do docker stats --no-stream; done
CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
                    --                  0.00%               0B / 0B             0.00%               0B / 0B             0B / 0B             0

fearful-symmetry on 5 Apr 2019

Thanks for all the effort to fix this problem.

I'd like to suggest a further investigation in the general behavior. It's not the first time a metric couldn't be processed and I'm sure it's not the last time. Which in general is fine. Nobody can ever account for all corner cases.

I think there is a "second" issue in the Beat code. If the module isn't able to convert the event it will fail on every try in the future. So put it back an reprocess the whole batch again (witch sends the healthy events over and over again) is a problem. It would be great if someone could have a look at this.
Also it would be great if the message Failed to encode event: unsupported float value: NaN could contain some more information regarding the reason or cause of the error.

lusche on 8 Apr 2019

@lusche
The PR I put in should be a more general fix, where we got events from docker before a container was started, leading to some weird behavior, including the NaNs.

I think there is a "second" issue in the Beat code. If the module isn't able to convert the event it will fail on every try in the future. So put it back an reprocess the whole batch again (witch sends the healthy events over and over again) is a problem. It would be great if someone could have a look at this.

Do you have any logs or other data relating to this? It would be worth opening a second issue for.

fearful-symmetry on 8 Apr 2019

@lusche @fearful-symmetry there is an issue to improve the general handling of these values https://github.com/elastic/beats/issues/9212

jsoriano on 8 Apr 2019

@fearful-symmetry

In the related discussion https://discuss.elastic.co/t/failed-to-publish-events-unsupported-float-value-nan/169771/4?u=sgerlach I posted some screenshots from our kibana showing a significant spike in events. When we analyzed this events, we found out it was only a handful events that get retransmitted over and over again.

Our “Setup” looks like this:

Beat -> Logstash in -> Kafka -> Logstash out -> Elasticsearch

When we receive events in “Logstash in” we keep the original timestamp as @timestamp and add a separate timestamp as indication when logstash received the event. Same in “Logstash out”.

In the analyzed data we could see the exact same @timestamp for the events but the other two timestamps moving forward. Also, the log message with the Error occurs over and over again. That’s an indication that the “broken” Event didn’t get dropped.

@jsoriano
Quote from #9212:

Normally the JSON encoding will fail and the event is dropped in the output. An error message is logged in this case.

Unfortunately, the event didn’t get dropped in our experience. That would be a desired behavior. It would fix our pain with the NaN Values.

lusche on 9 Apr 2019

@lusche It looks like the kibana screenshots you posted are from a different module entirely? I would open a separate issue for this.

fearful-symmetry on 9 Apr 2019

@fearful-symmetry

Yes the metrics are from another module. But the other module isn’t the problem. I’m willing to generate a new issue but it would be great if I have the feeling that we talk about the same thing.

English isn’t my native tongue, so maybe I’m not able to transport what I try to illustrate. Also, I’m not a Software Developer or a go expert so it could be that I get something wrong.

My understanding from the metricbeat / Custom Beat Framework is that there is a core module / part which provide some functionality like reading the configuration and message transport to the configured outputs.

In metricbeat in general are several modules which collect data and place the data in a construct like a “queue”. The core module takes a chunk of this data from the “queue”, transforms it to json and pushes it to the configured output.

That should describe the abstract way how metricbeat works?

Let’s now assume that the “submodules” which collect the data runs at nearly the same time and that they have different runtimes. And I would also assume that the results are not necessarily in order in the “queue”. So a module “A” Metric can be followed by a module “B” metric and so on.

If we now take a chunk or batch from the “queue” which contains the “broken” data and we try to encode this data we get the Error message. Parts of this chunk / batch could be processed so they get transmitted to elasticsearch or whatever output is defined. But the “broken” data didn’t get doped. Worse it looks like the whole chunk/batch or at least part of this chunk/batch get retried over and over again. Which leads to endless error messages in the log and endless duplicates of events in elasticsearch.

That should explain why you see “unrelated” events in the screenshot. It’s not the event on the screenshot which is broken. For me it looks like the handling of the error in the process of transformation data to json is broken.

As mention in the linked issue the Author assume that “broken” data gets droped. I think the “broken” data produces a loop that “kills” this instance of metricbeat and produces a very big amount of data. We managed to produce 1 TB of events in a single day while one of our metricbeat instances stuck in this loop.

lusche on 9 Apr 2019

@lusche I think you are right, I have been doing some tests and it seems that, at least with logstash output, events with encoding errors cause misbehaviours, whole batches are retried and/or dropped.

Could you try to reproduce this issue with the Kafka and/or Elasticsearch outputs?

jsoriano on 9 Apr 2019

With Elasticsearh output it doesn't happen.

jsoriano on 9 Apr 2019

@lusche I have created a different issue for this https://github.com/elastic/beats/issues/11732

Thanks for your report and detailed explanation.

jsoriano on 9 Apr 2019

lusche on 10 Apr 2019

❤2

The same happens with prometheus module (v6.7.2)

Edit: already tracked #10849