Fluent-bit: Not showing elasticsearch error message before retrying

Created on 9 Jul 2019 · 10Comments · Source: fluent/fluent-bit

Bug Report

Describe the bug
fluent-bit is receiving a errors from ElasticSearch but it's not warning the user. All we see is "new retry created for task_id"

To Reproduce
I think you can easily reproduce this by adding an output to ElasticSearch that will feed a different "type" of entries. From ES 6.x on (multiple mapping types are not supported in indices created in 6.0)[https://www.elastic.co/guide/en/elasticsearch/reference/6.0/breaking-changes-6.0.html], as it was (https://www.elastic.co/blog/index-vs-type)[common practice]. As such, ES will reject the records with:

{"took":13,"errors":true,"items":[{"index":{"_index":"mylog","_type":"syslog","_id":"asdfADGNsdfn2344n","status":400,"error":{"type":"illegal_argument_exception","reason":"Rejecting mapping update to [mylog] as the final mapping would have more than 1 type: [syslog, docker]"}}}]}

Note that the error refers to a record of _type=syslog, whereas another _type=docker was already being used in the cluster.

However, fluent-bit only shows this to the user:

[0] syslog.udp: [1562582290.719936179, {"timefield"=>"Jul  8 10:36:17", "ident"=>....}]
[2019/07/08 10:38:10] [debug] [task] created task=0x7fd172a3ef00 id=0 OK
[2019/07/08 10:38:11] [debug] [out_es] HTTP Status=200 URI=/_bulk
[2019/07/08 10:38:11] [debug] [retry] new retry created for task_id=0 attemps=1
[2019/07/08 10:38:11] [debug] [sched] retry=0x7fd172a0b7c0 0 in 7 seconds

We can only know that there was a problem because we see a retry, but we have no idea what the problem was. On other situations, I've seen fluent-bit showing the ES message, I don't know why it doesn't happen in this case.

Note: the output above was collected with Log_Level trace!

Expected behavior
Print ES response in case of error. In the special case of (at least) Log_Level trace, I'd just print it anyway, so that the user knows what's going on! :)

Your Environment

Version used: 1.2
Configuration: irrelevant
Environment name and version (e.g. Kubernetes? What version?): fluent-bit official docker image running on Kubernetes.

Source

ntavares

👍14

Most helpful comment

Hi I can't remember what the problem was nor how I fixed it, but this issue was more about the lack of a descriptive message describing why it failed, not the particular (syntax?) problem that was causing that error.

ntavares on 22 Feb 2020

👍2

All 10 comments

me too .

> `[2019/12/16 17:53:40] [debug] [out_es] HTTP Status=400 URI=/_bulk
> [2019/12/16 17:53:40] [debug] [task] task_id=0 reached retry-attemps limit 2/1
> [2019/12/16 17:53:40] [ warn] [engine] Task cannot be retried: task_id=0 thread_id=2 output=es.0
> [2019/12/16 17:53:40] [debug] [task] destroy task=0x7f0b4ca48500 (task_id=0)`

fauzan-n on 16 Dec 2019

You need to add the following lines to your "td-agent-bit.conf" file:

            tls    On
            tls.verify Off

To force Fluentbit to use HTTPS.

hlosukwakha on 17 Dec 2019

👎2

any update ? i'm facing the same situation.

ilanssari on 11 Feb 2020

@ntavares did you find the raison of the issue ?

ilanssari on 12 Feb 2020

Hi there,

I am getting similar message when trying add forward OTUPUT on to elastic search in our VPC suing HTTP authetication and getting similar error. Any suggestions?

Feb 21 10:38:06 td-agent-bit: [4] cpu.local: [1582281486.001175326, {"cpu_p"=>11.000000, "user_p"=>8.500000, "system_p"=>2.500000, "cpu0.p_cpu"=>9.000000, "cpu0.p_user"=>7. 000000, "cpu0.p_system"=>2.000000, "cpu1.p_cpu"=>14.000000, "cpu1.p_user"=>11.000000, "cpu1.p_system"=>3.000000}] Feb 21 10:38:06 td-agent-bit: [2020/02/21 10:38:06] [debug] [out_es] HTTP Status=400 URI=/_bulk Feb 21 10:38:06 td-agent-bit: [2020/02/21 10:38:06] [debug] [retry] new retry created for task_id=2 attemps=1 Feb 21 10:38:06 td-agent-bit: [2020/02/21 10:38:06] [ warn] [engine] failed to flush chunk '9843-1582281482.1391136.flb', retry in 8 seconds: task_id=2, input=cpu.0 > outpu t=es.1

Here is our tf-agent config file:

``
[INPUT]
Name cpu
Tag cpu.local
# Interval Sec
# ====
# Read interval (sec) Default: 1
Interval_Sec 1
[OUTPUT]
Name stdout
Match *

[OUTPUT]
Name es
Match *
Host vpcXXXXX.es.amazonaws.com
Port 443
index ec2-test-index
Logstash_Format On
Retry_limit 1
Type _doc
Replace_dots On
Logstash_Prefix tf-res-test-ec2
Time_Key @timestamp
HTTP_user es-access-user
HTTP_Passwd **
tls "on"
tls_verify on
tls.debug 1
``

shivshankarb on 21 Feb 2020

Hello
In my case it was just edit the name of the record from "log" to "log_message" and it works fine, btw you can debug the error by generating the json by adding another output (file) and sending manually the json with a XPOST request to the ES server.

ilanssari on 21 Feb 2020

Thanks for the comment
For reference my above error for HTTP Status=400:
tls "on" was the culprit. es was rejecting the request because of this
changed it to remove quotes and worked.
tls on

shivshankarb on 21 Feb 2020

👍1

ntavares on 22 Feb 2020

👍2

@edsiper being bitten again by this (lack of verbosity)... can we have some input from you?

ntavares on 25 Feb 2020

A bit more verbosity would be nice. Logging in to a file and check the error message from ES manually as ntdetect mentioned feels a bit dirty to my.