Beats: Filebeat S3 input plugin cannot parse jsonl file with content-type set as application/json

Created on 14 Jul 2020  路  6Comments  路  Source: elastic/beats

Hey! I had created this topic two weeks ago https://discuss.elastic.co/t/filebeat-s3-cannot-parse-jsonl-file-whos-content-type-is-set-to-application-json/239374 but no one has responded and I am convinced that this is a bug or something that should be fixed so I thought I'd try making an issue. I appreciate any feedback and sorry if this is just extra noise for y'all.

The problem is that the filebeat S3 input plugin cannot process an s3 object who's content-type is application/json AND the object content is a separate json object per line (i.e. jsonl). Processing such an object used to be possible until v7.7.0 when the S3 input plugin started enforcing json parsing if it saw a content-type of application/json: https://github.com/elastic/beats/blob/v7.7.0/x-pack/filebeat/input/s3/input.go#L432
Before that, json processing was ONLY controlled by the expand_event_list_from_field configuration like you can see here: https://github.com/elastic/beats/blob/v7.6.2/x-pack/filebeat/input/s3/input.go#L431

Now, it's probably the case that content-type on the s3 object should NOT be application/json in the first place but sadly I do not have control over that :frowning:. I'm essentially dealing with the same problem as https://github.com/elastic/beats/issues/18696 but cloudflare is the entity pushing the logs to s3 (where for him it is AWS GuardDuty) and I don't have control over how cloudflare sets content-type.

I had reproduced this issue by compiling filebeat off of master two weeks ago (here was the filbeat version output filebeat version 8.0.0 (amd64), libbeat 8.0.0 [3341c1bca5626d1ee90af692617f10f58695ed1c built 2020-06-30 20:15:28 +0000 UTC]) and, looking at the source code today (2297636b9f159526a77a29561df1a97d7bb93a18), it seems this problem is still there. Here were the steps I followed when I reproduced this issue:

$ cat s3filebeat.log 
{"id": "0001", "hey": "there", "how": {"are": "you"}}
{"id": "0002", "hope": "you", "are": {"doing": "well"}}
{"id": "0003", "I": "am", "doing": {"O": "K"}}

$ gzip s3filebeat.log 

$ aws --profile PROFILE s3api put-object --body ./s3filebeat.log.gz --bucket lucas-test-filebeat-s3 --content-encoding gzip --content-type application/json --key s3filebeat.log.gz
{
    "ETag": "\"955ed9f01b6ee38dbba167daab9ebbbb\""
}

$ cat filebeat.yml 
filebeat.inputs:
- type: s3
  queue_url: "https://sqs.us-east-1.amazonaws.com/ACCTNUM/cloudflare_logs_dev"
  role_arn: "arn:aws:iam::ACCTNUM:role/cloudflare_logs_filebeat_access_s3_sqs_dev"

output.console:
  pretty: true

$ ./filebeat -e
2020-06-30T15:39:04.180-0500    INFO    instance/beat.go:628    Home path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Config path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Data path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data] Logs path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs]
2020-06-30T15:39:04.180-0500    INFO    instance/beat.go:636    Beat ID: 2ef98d6a-ef7c-4885-a258-b4341a63b43b
2020-06-30T15:39:04.181-0500    INFO    [seccomp]   seccomp/seccomp.go:124  Syscall filter successfully installed
2020-06-30T15:39:04.181-0500    INFO    [beat]  instance/beat.go:964    Beat info   {"system_info": {"beat": {"path": {"config": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "data": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data", "home": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "logs": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs"}, "type": "filebeat", "uuid": "2ef98d6a-ef7c-4885-a258-b4341a63b43b"}}}
2020-06-30T15:39:04.181-0500    INFO    [beat]  instance/beat.go:973    Build info  {"system_info": {"build": {"commit": "3341c1bca5626d1ee90af692617f10f58695ed1c", "libbeat": "8.0.0", "time": "2020-06-30T20:15:28.000Z", "version": "8.0.0"}}}
2020-06-30T15:39:04.181-0500    INFO    [beat]  instance/beat.go:976    Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":12,"version":"go1.13.3"}}}
2020-06-30T15:39:04.182-0500    INFO    [beat]  instance/beat.go:980    Host info   {"system_info": {"host": {"architecture":"x86_64","boot_time":"2020-06-22T10:53:41-05:00","containerized":false,"name":"lgroenendaal-XPS-15-9570","ip":["127.0.0.1/8","::1/128","192.168.86.24/24","fe80::c38c:a654:ca3e:d0dd/64","10.20.211.109/32","fe80::d2ee:dd77:2943:aa45/64","172.17.0.1/16"],"kernel_version":"5.3.0-59-generic","mac":["9c:b6:d0:c6:01:39","02:42:e2:23:9b:84"],"os":{"family":"debian","platform":"ubuntu","name":"Ubuntu","version":"18.04.2 LTS (Bionic Beaver)","major":18,"minor":4,"patch":2,"codename":"bionic"},"timezone":"CDT","timezone_offset_sec":-18000,"id":"5a8843fa712d481595ebd41926cda45f"}}}
2020-06-30T15:39:04.183-0500    INFO    [beat]  instance/beat.go:1009   Process info    {"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":null,"effective":null,"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "exe": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/filebeat", "name": "filebeat", "pid": 5578, "ppid": 4156, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2020-06-30T15:39:03.640-0500"}}}
2020-06-30T15:39:04.183-0500    INFO    instance/beat.go:298    Setup Beat: filebeat; Version: 8.0.0
2020-06-30T15:39:04.183-0500    INFO    [publisher] pipeline/module.go:113  Beat name: lgroenendaal-XPS-15-9570
2020-06-30T15:39:04.184-0500    WARN    beater/filebeat.go:151  Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500    INFO    [monitoring]    log/log.go:118  Starting metrics logging every 30s
2020-06-30T15:39:04.184-0500    INFO    instance/beat.go:449    filebeat start running.
2020-06-30T15:39:04.184-0500    WARN    beater/filebeat.go:251  Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500    INFO    registrar/registrar.go:145  Loading registrar data from /home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data/registry/filebeat/data.json
2020-06-30T15:39:04.184-0500    INFO    registrar/registrar.go:152  States Loaded from registrar: 0
2020-06-30T15:39:04.184-0500    INFO    [crawler]   beater/crawler.go:71    Loading Inputs: 1
2020-06-30T15:39:04.185-0500    WARN    [cfgwarn]   s3/input.go:131 BETA: s3 input type is used
2020-06-30T15:39:04.185-0500    INFO    [crawler]   beater/crawler.go:141   Starting input (ID: 18222034013403473169)
2020-06-30T15:39:04.185-0500    INFO    [crawler]   beater/crawler.go:108   Loading and starting Inputs completed. Enabled inputs: 1
2020-06-30T15:39:04.186-0500    INFO    [s3]    s3/input.go:173 visibility timeout is set to 300 seconds
2020-06-30T15:39:04.186-0500    INFO    [s3]    s3/input.go:174 aws api timeout is set to 2m0s
2020-06-30T15:39:04.186-0500    INFO    [s3]    s3/input.go:196 s3 input worker has started. with queueURL: https://sqs.us-east-1.amazonaws.com/374144443638/cloudflare_logs_dev
2020-06-30T15:39:14.633-0500    ERROR   [s3]    s3/input.go:458 expand_event_list_from_field parameter is missing in config for application/json content-type file
2020-06-30T15:39:14.633-0500    ERROR   [s3]    s3/input.go:393 createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file

The imporant error is the last line which tells me that my object cannot be processed even though I think it should be processable: s3/input.go:393 createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file.

If I try, just for fun, to include the expand_event_list_from_field config it will, understandably, fail to parse and we'll get the WARN log: s3/input.go:542 decode json failed for 's3filebeat.log.gz' from S3 bucket 'lucas-test-filebeat-s3', skipping this file: json: cannot unmarshal string into Go value of type []interface {}.

Any advice on how to proceed here? Personally I think the appropriate change would be to ignore content-type: application/json and have expand_event_list_from_field be the ONLY thing controlling whether or not to parse the object content as JSON (which is how the logic used to be) but I defer to you maintainers because you have the vision for how you want the code to behave.

I really appreciate any who took the time to read this and I really apprecate the creation of this feature, it's a nifty one. Cheers.

Platforms

All 6 comments

Pinging @elastic/integrations-platforms (Team:Platforms)

Hi @lag13 sorry we missed the discuss issue! This does look like a bug to me so thank you for creating this issue. One question: What does AWS Cloudflare set for the content-type?

That's fine! I'm glad you saw this one :). Cloudflare (and it's just cloudflare, not "AWS cloudflare" https://www.cloudflare.com/) sets the content-type to application/json for the S3 objects it creates. Not sure if should be setting it to that (since the file is a separate json object on each line) but I have no control over it unfortunately and I feel like, regardless, the s3 input plugin should be at least capable of processing such an entity.

I'm curious as to why the code was changed in 7.7.0 to force json parsing if the content-type was application/json (https://github.com/elastic/beats/blob/v7.7.0/x-pack/filebeat/input/s3/input.go#L432) where before it was only controlled by the presence of the expand_event_list_from_field config: https://github.com/elastic/beats/blob/v7.6.2/x-pack/filebeat/input/s3/input.go#L431. Seems like that change is strictly worse since all it does is limit behavior but I defer to y'all on what you think is best!

Currently I'm using version 7.5.2 of filebeat to process things and that's working just fine. Thanks for getting back to me on this. Let me know if there's anything I can do to help. I'd be happy to attempt a contribution if you are swamped with other things.

@lag13 I just created an initial PR for fixing this issue. Do you mind sending me a real log file from Cloudflare for testing? Please feel free to review/test it. Thanks!

@kaiyan-sheng Thanks so much!!! I think what you have in your PR works just fine. The important bit is those separate json objects per line. That being said, the cloudflare logs looks like this:

{"CacheCacheStatus":"unknown","CacheResponseBytes":1963,"CacheResponseStatus":200,"CacheTieredFill":false,"ClientASN":14618,"ClientCountry":"us","ClientDeviceType":"desktop","ClientIP":"1.2.3.4","ClientIPClass":"noRecord","ClientRequestBytes":2400,"ClientRequestHost":"test.com","ClientRequestMethod":"POST","ClientRequestPath":"/hey/there","ClientRequestProtocol":"HTTP/1.1","ClientRequestReferer":"","ClientRequestURI":"/hey/there","ClientRequestUserAgent":"Apache-HttpClient/4.5.8 (Java/1.8.0_201)","ClientSSLCipher":"something-cool","ClientSSLProtocol":"TLSv1.2","ClientSrcPort":47463,"ClientXRequestedWith":"","EdgeColoCode":"IAD","EdgeColoID":16,"EdgeEndTimestamp":"2020-06-25T01:25:12Z","EdgePathingOp":"wl","EdgePathingSrc":"macro","EdgePathingStatus":"nr","EdgeRateLimitAction":"","EdgeRateLimitID":0,"EdgeRequestHost":"test.com","EdgeResponseBytes":972,"EdgeResponseCompressionRatio":0,"EdgeResponseContentType":"application/json","EdgeResponseStatus":200,"EdgeServerIP":"5.6.7.8","EdgeStartTimestamp":"2020-06-25T01:25:12Z","FirewallMatchesActions":[],"FirewallMatchesRuleIDs":[],"FirewallMatchesSources":[],"OriginIP":"1.2.3.4","OriginResponseBytes":0,"OriginResponseHTTPExpires":"","OriginResponseHTTPLastModified":"","OriginResponseStatus":200,"OriginResponseTime":18000000,"OriginSSLProtocol":"TLSv1.2","ParentRayID":"00","RayID":"5a8ae4af0e520d86","SecurityLevel":"high","WAFAction":"unknown","WAFFlags":"0","WAFMatchedVar":"","WAFProfile":"unknown","WAFRuleID":"","WAFRuleMessage":"","WorkerCPUTime":0,"WorkerStatus":"unknown","WorkerSubrequest":false,"WorkerSubrequestCount":0,"ZoneID":143880799}
{"CacheCacheStatus":"unknown","CacheResponseBytes":1963,"CacheResponseStatus":200,"CacheTieredFill":false,"ClientASN":14618,"ClientCountry":"us","ClientDeviceType":"desktop","ClientIP":"1.2.3.4","ClientIPClass":"noRecord","ClientRequestBytes":2400,"ClientRequestHost":"test.com","ClientRequestMethod":"POST","ClientRequestPath":"/hey/buddy","ClientRequestProtocol":"HTTP/1.1","ClientRequestReferer":"","ClientRequestURI":"/hey/buddy","ClientRequestUserAgent":"Apache-HttpClient/4.5.8 (Java/1.8.0_201)","ClientSSLCipher":"something-cool","ClientSSLProtocol":"TLSv1.2","ClientSrcPort":47463,"ClientXRequestedWith":"","EdgeColoCode":"IAD","EdgeColoID":16,"EdgeEndTimestamp":"2020-06-25T01:25:12Z","EdgePathingOp":"wl","EdgePathingSrc":"macro","EdgePathingStatus":"nr","EdgeRateLimitAction":"","EdgeRateLimitID":0,"EdgeRequestHost":"test.com","EdgeResponseBytes":972,"EdgeResponseCompressionRatio":0,"EdgeResponseContentType":"application/json","EdgeResponseStatus":200,"EdgeServerIP":"5.6.7.8","EdgeStartTimestamp":"2020-06-25T01:25:12Z","FirewallMatchesActions":[],"FirewallMatchesRuleIDs":[],"FirewallMatchesSources":[],"OriginIP":"1.2.3.4","OriginResponseBytes":0,"OriginResponseHTTPExpires":"","OriginResponseHTTPLastModified":"","OriginResponseStatus":200,"OriginResponseTime":18000000,"OriginSSLProtocol":"TLSv1.2","ParentRayID":"00","RayID":"5a8ae4af0e520d86","SecurityLevel":"high","WAFAction":"unknown","WAFFlags":"0","WAFMatchedVar":"","WAFProfile":"unknown","WAFRuleID":"","WAFRuleMessage":"","WorkerCPUTime":0,"WorkerStatus":"unknown","WorkerSubrequest":false,"WorkerSubrequestCount":0,"ZoneID":143880799}

.... etc....

I took a quick look at your PR. I'll try to test it out myself before tomorrow.

Was this page helpful?
0 / 5 - 0 ratings