Beats: [Functionbeat] Support S3 for log ingestion

Created on 2 Apr 2019 · 10Comments · Source: elastic/beats

There are some AWS services that deliver logs to S3. It would be great to let users ingest data directly from S3. AWS Lambda can be set up to trigger on "new file" events from S3 buckets. functionbeat would then need to pull the file from S3. We could potentially stream the data directly from the S3 bucket to avoid keeping it all in memory or on disk.

Functionbeat Services enhancement

Source

roncohen

👍6

Most helpful comment

which services specifically write to S3 only?
We're looking into this now, as to which services can ONLY write to an s3 bucket.

ALB and S3 access logs, for example.

rzerda on 29 Aug 2019

👍3

All 10 comments

@roncohen thanks for filing. Are these raw logs or are they zipped up in some fashion? Also, what type of scale (EPS) are we expecting here? My understanding is parallel processing from S3 may require a buffer like SQS to make it possible, similar to what we do in LS with the S3 SQS input.

/cc @ph @kvch for awareness

acchen97 on 3 Apr 2019

This could also make sense as an input for Filebeat, probably both can coexist, WDYT?

exekias on 9 Apr 2019

👍3

@exekias yes I would agree that it could make sense as a Filebeat input as well. Modules could probably also be shared, for instance Cloudwatch logs/metrics modules could similarly live in Functionbeat. This would offer the flexibility of serverless vs self-deploy ingest options.

acchen97 on 10 Apr 2019

👍1

Functionbeat always depends on the scale we are considering.

Yes, exactly what @acchen97 said, Concerning Filebeat I think we should do the following and provide a S3 inputs to support the following use cases:

Single ingester scenario, define an S3 inputs with a target bucket that scan the bucket for new files. This works great if you have only a single filebeat, easy getting starting and useful to read old buckets.
Provide an SQS based solution, either based on autodiscover or a different strategy in the input.

The problem I see with autodiscover it's a fire and forget (start / stop inputs), when you are using the SQS as the synchronization mechanism you have to confirm that the message was read or send a message to say that you are still working on it. This is to ensure that events from an SQS queue are acking the guaranteed of durability, in the s3 context it means that we have completely parsed the file.

ph on 12 Apr 2019

which services specifically write to S3 only?
We're looking into this now, as to which services can ONLY write to an s3 bucket.

Randy-312 on 23 Jul 2019

Since there are many files in a bucket, the way I've understood people to design this solution, is to have a Beat listen to an event, so that it can uniquely work on a file to consume it. In that way, you could then work out some auto-scaling capabilities as those servers (or services) become heavily utilized.

Randy-312 on 13 Aug 2019

which services specifically write to S3 only?
We're looking into this now, as to which services can ONLY write to an s3 bucket.

ALB and S3 access logs, for example.

rzerda on 29 Aug 2019

👍3

As an update here, S3 input made it to Filebeat, there are some sharing opportunities here: https://www.elastic.co/guide/en/beats/filebeat/master/filebeat-input-s3.html

exekias on 9 Oct 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.