Fluent-bit: Broken pipe error sending logs to datadog

Created on 23 Jul 2020 · 13Comments · Source: fluent/fluent-bit

Bug Report

Describe the bug
i'm running my services in AWS Fargate with a datadog agent sidecar and a fluent-bit sidecar that forwards logs to the ddg agent, as prescribed by datadog

This used to work fine, but in the last few days I see occasional crashes of the fluent-bit container with exit code 139, which subsequently crashes my service as well.

I'm using 906394416424.dkr.ecr.eu-west-1.amazonaws.com/aws-for-fluent-bit:latest, so I assume it's related to a recent version.

To Reproduce
This gist contains my task definition that had the issue. The crash happens about once a day at random times

Screenshots
Logs before the crash

2020-07-23 09:51:07[2020/07/23 06:51:07] [error] [src/flb_http_client.c:1077 errno=32] Broken pipe
2020-07-23 09:51:07[2020/07/23 06:51:07] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:80 (http_do=-1)
2020-07-23 09:51:07[engine] caught signal (SIGSEGV)
2020-07-23 09:50:37[2020/07/23 06:50:37] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:50:02[2020/07/23 06:50:02] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:37[2020/07/23 06:46:37] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:32[2020/07/23 06:46:32] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:27[2020/07/23 06:46:27] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:22[2020/07/23 06:46:22] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:14[2020/07/23 06:46:14] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:14[2020/07/23 06:46:14] [ info] [engine] flush chunk '1-1595412363.546270069.flb' succeeded at retry 1: task_id=1, input=forward.0 > output=datadog.1
2020-07-23 09:46:12[2020/07/23 06:46:12] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:07[2020/07/23 06:46:07] [error] [src/flb_http_client.c:1077 errno=32] Broken pipe
2020-07-23 09:46:07[2020/07/23 06:46:07] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:80 (http_do=-1)
2020-07-23 09:46:07[2020/07/23 06:46:07] [ warn] [engine] failed to flush chunk '1-1595412363.546270069.flb', retry in 7 seconds: task_id=0, input=forward.0 > output=datadog.1
2020-07-23 09:45:37[2020/07/23 06:45:37] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:45:02[2020/07/23 06:45:02] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:41:37[2020/07/23 06:41:37] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}

Your Environment

Version used: latest, 2.5.0
Configuration: see above
Environment name and version AWS Fargate

AWS

Source

dotanrs

👍1

Most helpful comment

@dotanrs Ok, luckily DataDog gave me a test account for this integration... I will try my best to see if I can reproduce and diagnose this issue.

PettitWesley on 18 Aug 2020

🎉2

All 13 comments

@dotanrs Be aware of: https://github.com/fluent/fluent-bit/issues/2379

Which was fixed in Fluent Bit 1.5.2, which has not yet made it into AWS for Fluent Bit. That doesn't explain the broken pipe problem, but the memory leak would cause Fluent Bit's memory to increase over time until it eventually was killed in Fargate.

PettitWesley on 25 Jul 2020

@dotanrs Can you try running the same Task Definition on an ECS EC2 Cluster, and replace your fluent bit image with fluent/fluent-bit:1.5.2.

Why: Unfortunately it is possible you may have run into two different issues:

A possible (unconfirmed) networking instability in the newest version of Fargate: https://github.com/aws/containers-roadmap/issues/992
The memory leak (which would cause fluent bit to killed) in the datadog plugin, which I mentioned in my last comment

If you run on ECS EC2 and use Fluent Bit 1.5.2 and the issue persists, then we know we have separate bug/problem unrelated to the other two. Apologies for the inconvenience caused by both of these.

PettitWesley on 25 Jul 2020

@edsiper I created an "AWS" label for any issue affecting an AWS plugin or an AWS user; which will help me find them and triage/respond from the AWS side.

PettitWesley on 25 Jul 2020

@edsiper Another report which is the same: https://github.com/aws/aws-for-fluent-bit/issues/63

PettitWesley on 31 Jul 2020

The image 906394416424.dkr.ecr.eu-west-1.amazonaws.com/aws-for-fluent-bit:2.6.1
Uses fluent-bit 1.5.2

I'll try this and see how it goes
(Unfortunately I can't change my deployment type to EC2)

dotanrs on 16 Aug 2020

(Unfortunately I can't change my deployment type to EC2)

Even just for a one-off test?

PettitWesley on 17 Aug 2020

@PettitWesley I don't currently run anything on EC2, so it's a lot of hassle for me to set it up.
So far no crashes with the new image 🤞

dotanrs on 17 Aug 2020

Problem persists with 906394416424.dkr.ecr.eu-west-1.amazonaws.com/aws-for-fluent-bit:2.6.1 :(

Again, fluent-bit container existed with status code 139

Logs:

[2020/08/18 13:33:40] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
[2020/08/18 13:34:10] [error] [src/flb_http_client.c:1077 errno=32] Broken pipe
[engine] caught signal (SIGSEGV)
[2020/08/18 13:34:10] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:80 (http_do=-1)

dotanrs on 18 Aug 2020

@dotanrs Ok, luckily DataDog gave me a test account for this integration... I will try my best to see if I can reproduce and diagnose this issue.

PettitWesley on 18 Aug 2020

🎉2

@dotanrs So far nothing... my DataDog FireLens task has been running for a day and its fine. I'll keep it going and check every few days though.

I'm not sure what to do to try triggering this- does your app log at a high rate? My fake testing app only emits one message per second.

PettitWesley on 21 Aug 2020

@PettitWesley I'm not writing that many logs, but it does come in bursts.
How much memory did you give it? Did you use configs like in the gist I shared?
I have about 30+ services and only ~1 crash per day.

If the assumption is the issue is some kind of memory leak, it would be interesting to see the memory usage graph for your dummy service. Is it flat?
I'd also try to decrease the memory for the containers and increase the log rate to try to aggravate it

Thanks again for the help!

dotanrs on 21 Aug 2020

We think we have made some progress in debugging this issue. Please see the suggestion here: https://github.com/aws/aws-for-fluent-bit/issues/66#issuecomment-684371904

PettitWesley on 1 Sep 2020

👍1

@PettitWesley this works for me 🙂
Thanks!

dotanrs on 14 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings