Describe the bug
i'm running my services in AWS Fargate with a datadog agent sidecar and a fluent-bit sidecar that forwards logs to the ddg agent, as prescribed by datadog
This used to work fine, but in the last few days I see occasional crashes of the fluent-bit container with exit code 139, which subsequently crashes my service as well.
I'm using 906394416424.dkr.ecr.eu-west-1.amazonaws.com/aws-for-fluent-bit:latest, so I assume it's related to a recent version.
To Reproduce
This gist contains my task definition that had the issue. The crash happens about once a day at random times
Screenshots
Logs before the crash
2020-07-23 09:51:07[2020/07/23 06:51:07] [error] [src/flb_http_client.c:1077 errno=32] Broken pipe
2020-07-23 09:51:07[2020/07/23 06:51:07] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:80 (http_do=-1)
2020-07-23 09:51:07[engine] caught signal (SIGSEGV)
2020-07-23 09:50:37[2020/07/23 06:50:37] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:50:02[2020/07/23 06:50:02] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:37[2020/07/23 06:46:37] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:32[2020/07/23 06:46:32] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:27[2020/07/23 06:46:27] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:22[2020/07/23 06:46:22] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:14[2020/07/23 06:46:14] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:14[2020/07/23 06:46:14] [ info] [engine] flush chunk '1-1595412363.546270069.flb' succeeded at retry 1: task_id=1, input=forward.0 > output=datadog.1
2020-07-23 09:46:12[2020/07/23 06:46:12] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:46:07[2020/07/23 06:46:07] [error] [src/flb_http_client.c:1077 errno=32] Broken pipe
2020-07-23 09:46:07[2020/07/23 06:46:07] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:80 (http_do=-1)
2020-07-23 09:46:07[2020/07/23 06:46:07] [ warn] [engine] failed to flush chunk '1-1595412363.546270069.flb', retry in 7 seconds: task_id=0, input=forward.0 > output=datadog.1
2020-07-23 09:45:37[2020/07/23 06:45:37] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:45:02[2020/07/23 06:45:02] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
2020-07-23 09:41:37[2020/07/23 06:41:37] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
Your Environment
@dotanrs Be aware of: https://github.com/fluent/fluent-bit/issues/2379
Which was fixed in Fluent Bit 1.5.2, which has not yet made it into AWS for Fluent Bit. That doesn't explain the broken pipe problem, but the memory leak would cause Fluent Bit's memory to increase over time until it eventually was killed in Fargate.
@dotanrs Can you try running the same Task Definition on an ECS EC2 Cluster, and replace your fluent bit image with fluent/fluent-bit:1.5.2.
Why: Unfortunately it is possible you may have run into two different issues:
If you run on ECS EC2 and use Fluent Bit 1.5.2 and the issue persists, then we know we have separate bug/problem unrelated to the other two. Apologies for the inconvenience caused by both of these.
@edsiper I created an "AWS" label for any issue affecting an AWS plugin or an AWS user; which will help me find them and triage/respond from the AWS side.
@edsiper Another report which is the same: https://github.com/aws/aws-for-fluent-bit/issues/63
The image 906394416424.dkr.ecr.eu-west-1.amazonaws.com/aws-for-fluent-bit:2.6.1
Uses fluent-bit 1.5.2
I'll try this and see how it goes
(Unfortunately I can't change my deployment type to EC2)
(Unfortunately I can't change my deployment type to EC2)
Even just for a one-off test?
@PettitWesley I don't currently run anything on EC2, so it's a lot of hassle for me to set it up.
So far no crashes with the new image 馃
Problem persists with 906394416424.dkr.ecr.eu-west-1.amazonaws.com/aws-for-fluent-bit:2.6.1 :(
Again, fluent-bit container existed with status code 139
Logs:
[2020/08/18 13:33:40] [ info] [output:datadog:datadog.1] http://http-intake.logs.datadoghq.com, port=80, HTTP status=200 payload={}
聽 [2020/08/18 13:34:10] [error] [src/flb_http_client.c:1077 errno=32] Broken pipe
聽 [engine] caught signal (SIGSEGV)
聽 [2020/08/18 13:34:10] [error] [output:datadog:datadog.1] could not flush records to http-intake.logs.datadoghq.com:80 (http_do=-1)
@dotanrs Ok, luckily DataDog gave me a test account for this integration... I will try my best to see if I can reproduce and diagnose this issue.
@dotanrs So far nothing... my DataDog FireLens task has been running for a day and its fine. I'll keep it going and check every few days though.
I'm not sure what to do to try triggering this- does your app log at a high rate? My fake testing app only emits one message per second.
@PettitWesley I'm not writing that many logs, but it does come in bursts.
How much memory did you give it? Did you use configs like in the gist I shared?
I have about 30+ services and only ~1 crash per day.
If the assumption is the issue is some kind of memory leak, it would be interesting to see the memory usage graph for your dummy service. Is it flat?
I'd also try to decrease the memory for the containers and increase the log rate to try to aggravate it
Thanks again for the help!
We think we have made some progress in debugging this issue. Please see the suggestion here: https://github.com/aws/aws-for-fluent-bit/issues/66#issuecomment-684371904
@PettitWesley this works for me 馃檪
Thanks!
Most helpful comment
@dotanrs Ok, luckily DataDog gave me a test account for this integration... I will try my best to see if I can reproduce and diagnose this issue.